kubespray-troubleshooting

📁 sigridjineth/kubespray-skills 📅 Today

总安装量

周安装量

#77433

全站排名

安装命令

npx skills add https://github.com/sigridjineth/kubespray-skills --skill kubespray-troubleshooting

Agent 安装分布

windsurf 1

amp 1

cline 1

opencode 1

cursor 1

kimi-cli 1

Skill 文档

Kubespray Troubleshooting

Overview

Diagnose and fix common Kubespray deployment failures. Most failures stem from network misconfiguration, etcd issues, or stale state from previous attempts.

Core principle: Read the exact task name that failed, check logs on that specific node, then fix and re-run (Ansible is idempotent).

When to Use

Deployment fails mid-playbook
kubeadm join errors
etcd health check timeouts
Nodes stuck in NotReady state
Certificate-related failures

Not for: Initial deployment setup (use kubespray-deployment), upgrades (use kubespray-operations), certificate renewal (use kubespray-certificates)

Quick Diagnostic Flow

Playbook failed
       â
       â¼
âââââââââââââââââââ
â  Which task?    â
ââââââââââ¬âââââââââ
         â
    ââââââ¼âââââ¬âââââââââââââ
    â    â    â            â
    â¼    â¼    â¼            â¼
  etcd  join  containerd  other
    â    â    â            â
    â¼    â¼    â¼            â¼
  Check  Check  Check    Check
  etcd   IP     containerd Ansible
  logs   config status    logs -vvv

Task Failed	First Check	Command
etcd health	etcd logs	`journalctl -u etcd -f`
kubeadm join	IP configuration	Verify `ip=` in inventory
container-engine	containerd status	`systemctl status containerd`
download	Network/proxy	Check internet connectivity
any task	Ansible debug	Re-run with `-vvv` flag

Problem: VirtualBox NAT IP (10.0.2.15)

Symptom:

error execution phase preflight: couldn't validate the identity of the API Server:
Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info":
dial tcp 10.0.2.15:6443: connect: connection refused

Cause: Kubespray detected VirtualBox NAT interface instead of host-only network.

Fix: Add explicit ip= to inventory:

k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10

If already deployed with wrong IP: Must reset and redeploy:

ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
# Fix inventory, then:
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

Problem: etcd Health Check Failure

Symptom:

TASK [etcd : Configure | Wait for etcd cluster to be healthy]
fatal: [controller-0]: FAILED! => {"cmd": "etcdctl endpoint health"...
"dial tcp 192.168.10.100:2379: connect: connection refused"

Diagnose:

# On etcd node
systemctl status etcd
journalctl -u etcd -f

# Check if listening
ss -tlnp | grep 2379

Common causes:

Wrong IP in etcd config – Reset and redeploy with correct ip=
Certificate mismatch – Check /etc/ssl/etcd/ssl/ permissions
Firewall blocking – Ensure ports 2379/2380 open

Fix for stale state:

ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

Problem: Nodes Stuck NotReady

Symptom: kubectl get nodes shows NotReady status

Diagnose:

# Check kubelet
systemctl status kubelet
journalctl -u kubelet -f

# Check CNI
ls /etc/cni/net.d/
ls /opt/cni/bin/

# Check node conditions
kubectl describe node <node-name>

Common causes:

CNI not installed – Check network_plugin role completed
containerd not running – systemctl restart containerd
kubelet misconfigured – Check /etc/kubernetes/kubelet-config.yaml

Problem: “No hosts matched”

Symptom:

[WARNING]: Could not match supplied host pattern, ignoring: etcd
skipping: no hosts matched

Cause: Inventory path or syntax error

Fix:

# Use file path, not directory
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

# Verify inventory parses correctly
ansible -i inventory/mycluster/inventory.ini etcd --list-hosts
ansible -i inventory/mycluster/inventory.ini kube_control_plane --list-hosts

Problem: Container Runtime Not Running

Symptom:

[ERROR CRI]: container runtime is not running:
"transport: Error while dialing dial unix /var/run/containerd/containerd.sock:
connect: no such file or directory"

Fix:

# Check containerd
systemctl status containerd
journalctl -u containerd

# Restart if needed
systemctl restart containerd

# Verify socket exists
ls -la /var/run/containerd/containerd.sock

Problem: Certificate Errors

Symptom:

x509: certificate has expired or is not yet valid

Diagnose:

# Check cert expiration
kubeadm certs check-expiration

# Check specific cert
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

Fix: See kubespray-certificates skill for renewal procedures.

Reset Procedure

When deployment is corrupted beyond repair:

# Full reset - removes all K8s components
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b

# Confirm with "yes" when prompted

# After reset, verify clean state
systemctl status kubelet  # should be inactive
ls /etc/kubernetes/        # should be empty/minimal

# Redeploy
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

Note: Reset removes etcd data. All cluster state is lost.

Log Locations

Component	Log Command
etcd	`journalctl -u etcd`
kubelet	`journalctl -u kubelet`
containerd	`journalctl -u containerd`
API server	`kubectl logs -n kube-system kube-apiserver-<node>`
Ansible	Run with `-vvv` for debug output

Re-running After Failure

Ansible is idempotent – safe to re-run after fixing issues:

# Re-run full playbook (skips completed tasks)
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

# Re-run specific tags only
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b --tags etcd
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b --tags network

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台