kubernetes-troubleshooting
1
总安装量
1
周安装量
#42755
全站排名
安装命令
npx skills add https://github.com/latestaiagents/agent-skills --skill kubernetes-troubleshooting
Agent 安装分布
mcpjam
1
claude-code
1
replit
1
windsurf
1
zencoder
1
Skill 文档
Kubernetes Troubleshooting
Systematic approaches to diagnose and fix common Kubernetes issues.
Troubleshooting Framework
1. What's the symptom? (pod not starting, service unreachable, etc.)
2. Where's the problem? (pod, service, ingress, node, cluster)
3. What do the events say?
4. What do the logs say?
5. What changed recently?
Pod Issues
Pod Status Quick Reference
| Status | Meaning | First Check |
|---|---|---|
| Pending | Can’t be scheduled | kubectl describe pod |
| ContainerCreating | Image pulling or volume mounting | Events, kubectl get events |
| CrashLoopBackOff | Container crashes repeatedly | kubectl logs --previous |
| ImagePullBackOff | Can’t pull container image | Image name, credentials |
| Error | Container exited with error | kubectl logs |
| OOMKilled | Out of memory | Increase memory limits |
| Evicted | Node under pressure | Node resources, pod priority |
Debugging Commands
# Get pod status
kubectl get pod <pod-name> -o wide
# Describe pod (events, conditions)
kubectl describe pod <pod-name>
# Get logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container> # specific container
kubectl logs <pod-name> --previous # previous crash
# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh
# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp'
CrashLoopBackOff
Symptoms: Pod restarts repeatedly
Common Causes:
ââ Application error on startup
ââ Missing config/secrets
ââ Liveness probe failing too soon
ââ Resource limits too low
ââ Dependency not ready
Debug Steps:
1. kubectl logs <pod> --previous
2. kubectl describe pod <pod> # check events
3. Check liveness probe configuration
4. Check resource limits
5. Verify ConfigMaps/Secrets exist
ImagePullBackOff
Symptoms: Container image can't be pulled
Common Causes:
ââ Image doesn't exist
ââ Wrong image name/tag
ââ Private registry, missing credentials
ââ Registry rate limiting
ââ Network issues
Debug Steps:
1. Verify image name: kubectl describe pod <pod>
2. Try pulling manually: docker pull <image>
3. Check imagePullSecrets in pod spec
4. Verify secret exists: kubectl get secret <secret-name>
5. Check registry status
Pending Pods
Symptoms: Pod stuck in Pending state
Common Causes:
ââ Insufficient resources (CPU/memory)
ââ No nodes match nodeSelector/affinity
ââ PVC can't be bound
ââ Taint with no toleration
ââ ResourceQuota exceeded
Debug Steps:
1. kubectl describe pod <pod> # check Events
2. kubectl get nodes -o wide # check node capacity
3. kubectl describe node <node> # check allocatable
4. kubectl get pvc # check volume claims
5. kubectl get resourcequota # check quotas
Service & Networking Issues
Service Not Working
# Check service exists and has endpoints
kubectl get svc <service>
kubectl get endpoints <service>
# If no endpoints, check selector matches pods
kubectl get pods -l <selector-from-service>
# Test from inside cluster
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>
# Check DNS resolution
kubectl run test --rm -it --image=busybox -- nslookup <service>
Debugging Checklist
â¡ Service exists and has correct port
â¡ Endpoints exist (pods are selected)
â¡ Pod selector labels match
â¡ Pods are Running and Ready
â¡ Container is listening on correct port
â¡ NetworkPolicy isn't blocking traffic
â¡ DNS resolves correctly
Ingress Issues
# Check ingress configuration
kubectl describe ingress <ingress-name>
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Verify backend service
kubectl get svc <backend-service>
# Check TLS secret
kubectl get secret <tls-secret>
Deployment Issues
Deployment Not Rolling Out
# Check rollout status
kubectl rollout status deployment/<name>
# Check deployment events
kubectl describe deployment <name>
# Check replicaset
kubectl get rs -l app=<name>
kubectl describe rs <replicaset-name>
# Rollback if needed
kubectl rollout undo deployment/<name>
Common Deployment Problems
Symptom: New pods not creating
Check:
ââ ResourceQuota limits
ââ PodDisruptionBudget blocking
ââ Node capacity
Symptom: Old pods not terminating
Check:
ââ terminationGracePeriodSeconds
ââ PreStop hooks stuck
ââ Finalizers blocking deletion
Symptom: Rollout stuck
Check:
ââ maxUnavailable settings
ââ Readiness probe never passes
ââ PVC can't be detached
Node Issues
Node Not Ready
# Check node status
kubectl get nodes
kubectl describe node <node>
# Check node conditions
kubectl get node <node> -o jsonpath='{.status.conditions[*].type}'
# Check kubelet logs (on node)
journalctl -u kubelet -f
# Drain node if needed
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
Resource Pressure
# Check node resources
kubectl top nodes
# Check which pods are using resources
kubectl top pods --all-namespaces
# Find pods on specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>
Quick Diagnostic Commands
# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20
# Specific namespace health
kubectl get all -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>
# Network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash
Systematic Debug Template
## Issue: [Brief description]
### Symptom
[What's happening]
### Affected Resources
- Namespace:
- Deployment/Pod:
- Service:
### Investigation
#### Step 1: Check Status
kubectl get pod kubectl describe pod
Findings: [...]
#### Step 2: Check Logs
kubectl logs
Findings: [...]
#### Step 3: Check Events
kubectl get events –sort-by=’.lastTimestamp’
Findings: [...]
### Root Cause
[What caused the issue]
### Resolution
[What fixed it]
### Prevention
[How to prevent recurrence]
Emergency Procedures
Force Delete Stuck Pod
# Only use when pod is truly stuck
kubectl delete pod <pod> --grace-period=0 --force
Emergency Rollback
# Immediate rollback
kubectl rollout undo deployment/<name>
# Rollback to specific revision
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name> --to-revision=<n>
Scale Down Quickly
# Scale to zero
kubectl scale deployment/<name> --replicas=0
# Scale back up
kubectl scale deployment/<name> --replicas=3