kubespray-troubleshooting
npx skills add https://github.com/sigridjineth/kubespray-skills --skill kubespray-troubleshooting
Agent 安装分布
Skill 文档
Kubespray Troubleshooting
Overview
Diagnose and fix common Kubespray deployment failures. Most failures stem from network misconfiguration, etcd issues, or stale state from previous attempts.
Core principle: Read the exact task name that failed, check logs on that specific node, then fix and re-run (Ansible is idempotent).
When to Use
- Deployment fails mid-playbook
kubeadm joinerrors- etcd health check timeouts
- Nodes stuck in NotReady state
- Certificate-related failures
Not for: Initial deployment setup (use kubespray-deployment), upgrades (use kubespray-operations), certificate renewal (use kubespray-certificates)
Quick Diagnostic Flow
Playbook failed
â
â¼
âââââââââââââââââââ
â Which task? â
ââââââââââ¬âââââââââ
â
ââââââ¼âââââ¬âââââââââââââ
â â â â
â¼ â¼ â¼ â¼
etcd join containerd other
â â â â
â¼ â¼ â¼ â¼
Check Check Check Check
etcd IP containerd Ansible
logs config status logs -vvv
| Task Failed | First Check | Command |
|---|---|---|
| etcd health | etcd logs | journalctl -u etcd -f |
| kubeadm join | IP configuration | Verify ip= in inventory |
| container-engine | containerd status | systemctl status containerd |
| download | Network/proxy | Check internet connectivity |
| any task | Ansible debug | Re-run with -vvv flag |
Problem: VirtualBox NAT IP (10.0.2.15)
Symptom:
error execution phase preflight: couldn't validate the identity of the API Server:
Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info":
dial tcp 10.0.2.15:6443: connect: connection refused
Cause: Kubespray detected VirtualBox NAT interface instead of host-only network.
Fix: Add explicit ip= to inventory:
k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10
If already deployed with wrong IP: Must reset and redeploy:
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
# Fix inventory, then:
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
Problem: etcd Health Check Failure
Symptom:
TASK [etcd : Configure | Wait for etcd cluster to be healthy]
fatal: [controller-0]: FAILED! => {"cmd": "etcdctl endpoint health"...
"dial tcp 192.168.10.100:2379: connect: connection refused"
Diagnose:
# On etcd node
systemctl status etcd
journalctl -u etcd -f
# Check if listening
ss -tlnp | grep 2379
Common causes:
- Wrong IP in etcd config – Reset and redeploy with correct
ip= - Certificate mismatch – Check
/etc/ssl/etcd/ssl/permissions - Firewall blocking – Ensure ports 2379/2380 open
Fix for stale state:
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
Problem: Nodes Stuck NotReady
Symptom: kubectl get nodes shows NotReady status
Diagnose:
# Check kubelet
systemctl status kubelet
journalctl -u kubelet -f
# Check CNI
ls /etc/cni/net.d/
ls /opt/cni/bin/
# Check node conditions
kubectl describe node <node-name>
Common causes:
- CNI not installed – Check network_plugin role completed
- containerd not running –
systemctl restart containerd - kubelet misconfigured – Check
/etc/kubernetes/kubelet-config.yaml
Problem: “No hosts matched”
Symptom:
[WARNING]: Could not match supplied host pattern, ignoring: etcd
skipping: no hosts matched
Cause: Inventory path or syntax error
Fix:
# Use file path, not directory
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
# Verify inventory parses correctly
ansible -i inventory/mycluster/inventory.ini etcd --list-hosts
ansible -i inventory/mycluster/inventory.ini kube_control_plane --list-hosts
Problem: Container Runtime Not Running
Symptom:
[ERROR CRI]: container runtime is not running:
"transport: Error while dialing dial unix /var/run/containerd/containerd.sock:
connect: no such file or directory"
Fix:
# Check containerd
systemctl status containerd
journalctl -u containerd
# Restart if needed
systemctl restart containerd
# Verify socket exists
ls -la /var/run/containerd/containerd.sock
Problem: Certificate Errors
Symptom:
x509: certificate has expired or is not yet valid
Diagnose:
# Check cert expiration
kubeadm certs check-expiration
# Check specific cert
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
Fix: See kubespray-certificates skill for renewal procedures.
Reset Procedure
When deployment is corrupted beyond repair:
# Full reset - removes all K8s components
ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b
# Confirm with "yes" when prompted
# After reset, verify clean state
systemctl status kubelet # should be inactive
ls /etc/kubernetes/ # should be empty/minimal
# Redeploy
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
Note: Reset removes etcd data. All cluster state is lost.
Log Locations
| Component | Log Command |
|---|---|
| etcd | journalctl -u etcd |
| kubelet | journalctl -u kubelet |
| containerd | journalctl -u containerd |
| API server | kubectl logs -n kube-system kube-apiserver-<node> |
| Ansible | Run with -vvv for debug output |
Re-running After Failure
Ansible is idempotent – safe to re-run after fixing issues:
# Re-run full playbook (skips completed tasks)
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b
# Re-run specific tags only
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b --tags etcd
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b --tags network