debug:kubernetes
npx skills add https://github.com/snakeo/claude-debug-and-refactor-skills-plugin --skill debug:kubernetes
Agent 安装分布
Skill 文档
Kubernetes Debugging Guide
A systematic approach to diagnosing and resolving Kubernetes issues. Always start with the basics: check events and logs first.
Common Error Patterns
CrashLoopBackOff
What it means: Container repeatedly crashes and fails to start. Kubernetes restarts it with exponential backoff (10s, 20s, 40s… up to 5 minutes).
Common causes:
- Insufficient memory/CPU resources
- Missing dependencies in container image
- Misconfigured liveness/readiness probes
- Application code errors or misconfigurations
- Missing environment variables or secrets
Debug steps:
# Check pod events and status
kubectl describe pod <pod-name> -n <namespace>
# View current and previous container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
# Check resource limits vs actual usage
kubectl top pod <pod-name> -n <namespace>
Solutions:
- Tune probe
initialDelaySecondsandtimeoutSeconds - Increase resource limits if hitting memory/CPU caps
- Fix missing dependencies in Dockerfile
- Review application startup code for errors
ImagePullBackOff
What it means: Kubernetes cannot pull the container image. Retries with increasing delay (5s, 10s, 20s… up to 5 minutes).
Common causes:
- Incorrect image name or tag
- Missing registry authentication credentials
- Private registry without imagePullSecrets configured
- Network connectivity issues to registry
- Image does not exist in registry
Debug steps:
# Check pod events for specific error
kubectl describe pod <pod-name> -n <namespace>
# Verify image name in deployment
kubectl get deployment <name> -n <namespace> -o yaml | grep image:
# Check if imagePullSecrets are configured
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 imagePullSecrets
# Test pulling image from node (if you have node access)
docker pull <image-name>
Solutions:
- Correct image name/tag in deployment spec
- Create and attach imagePullSecret for private registries
- Verify network access to container registry
- Check registry credentials haven’t expired
Pending Pods (Scheduling Failures)
What it means: Pod cannot be scheduled to any node.
Common causes:
- Insufficient cluster resources (CPU/memory)
- Node selectors or affinity rules cannot be satisfied
- Taints without matching tolerations
- PersistentVolumeClaim not bound
- Resource quotas exceeded
Debug steps:
# Check why pod is pending
kubectl describe pod <pod-name> -n <namespace>
# View cluster events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top nodes
# Check PVC status if using persistent storage
kubectl get pvc -n <namespace>
Solutions:
- Scale up cluster or reduce resource requests
- Adjust nodeSelector/affinity rules
- Add tolerations for node taints
- Create or fix PersistentVolume bindings
- Increase namespace resource quotas
OOMKilled
What it means: Container was forcefully terminated (SIGKILL, exit code 137) for exceeding memory limit.
Common causes:
- Memory limit set too low for application
- Memory leak in application code
- Processing large files or datasets
- High concurrency causing memory spikes
- JVM/runtime heap misconfiguration
Debug steps:
# Check termination reason
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Last State"
# View logs before termination
kubectl logs <pod-name> -n <namespace> --previous
# Check memory limits vs usage
kubectl top pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 resources:
Solutions:
- Increase memory limits in deployment spec
- Profile application for memory leaks
- Configure application memory settings (e.g., JVM -Xmx)
- Implement memory-efficient processing patterns
- Add horizontal pod autoscaling for load distribution
Service Not Reachable
What it means: Cannot connect to service from within or outside cluster.
Common causes:
- Service selector doesn’t match pod labels
- Pod not ready (failing readiness probe)
- NetworkPolicy blocking traffic
- Service port mismatch with container port
- Ingress/LoadBalancer misconfiguration
Debug steps:
# Check service and endpoints
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Verify pod labels match service selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A5 selector
# Test connectivity from within cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service-name>.<namespace>:port
# Check network policies
kubectl get networkpolicy -n <namespace>
Solutions:
- Fix service selector to match pod labels
- Ensure pods are passing readiness probes
- Update NetworkPolicy to allow required traffic
- Verify port configurations match
PVC Binding Failures
What it means: PersistentVolumeClaim cannot bind to a PersistentVolume.
Common causes:
- No PV available matching PVC requirements
- StorageClass not found or misconfigured
- Access mode mismatch (RWO vs RWX)
- Storage capacity insufficient
- Zone/region constraints not met
Debug steps:
# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
# List available PVs
kubectl get pv
# Check StorageClass
kubectl get storageclass
kubectl describe storageclass <name>
# View provisioner events
kubectl get events -n <namespace> --field-selector reason=ProvisioningFailed
Solutions:
- Create matching PersistentVolume manually
- Fix StorageClass name or create required class
- Adjust access mode or capacity requirements
- Enable dynamic provisioning if available
RBAC Permission Denied
What it means: Service account lacks required permissions for API operations.
Common causes:
- Missing Role or ClusterRole
- RoleBinding not created for service account
- Wrong namespace for RoleBinding
- Insufficient permissions in Role
Debug steps:
# Check what service account pod uses
kubectl get pod <pod-name> -n <namespace> -o yaml | grep serviceAccountName
# Test permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<sa-name>
# List roles and bindings
kubectl get roles,rolebindings -n <namespace>
kubectl get clusterroles,clusterrolebindings | grep <relevant-name>
# Describe specific binding
kubectl describe rolebinding <name> -n <namespace>
Solutions:
- Create Role/ClusterRole with required permissions
- Create RoleBinding/ClusterRoleBinding
- Verify binding references correct service account
- Use namespace-scoped roles when possible
Debugging Tools Reference
kubectl describe
Get detailed information about any resource including events.
kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name>
kubectl describe svc <service-name> -n <namespace>
kubectl describe deployment <name> -n <namespace>
kubectl logs
View container stdout/stderr logs.
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> -c <container-name> # specific container
kubectl logs <pod-name> -n <namespace> --previous # previous instance
kubectl logs <pod-name> -n <namespace> -f # follow/stream
kubectl logs <pod-name> -n <namespace> --tail=100 # last 100 lines
kubectl logs -l app=myapp -n <namespace> # by label selector
kubectl exec
Execute commands inside running container.
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
kubectl exec <pod-name> -n <namespace> -- cat /etc/config/app.conf
kubectl exec <pod-name> -n <namespace> -c <container> -- env # specific container
kubectl debug
Debug nodes or pods with ephemeral containers (K8s 1.23+).
# Debug a running pod with ephemeral container
kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container>
# Debug a node
kubectl debug node/<node-name> -it --image=busybox
# Create debug copy of pod with different image
kubectl debug <pod-name> -it --copy-to=debug-pod --container=app --image=busybox
kubectl get events
View cluster events for troubleshooting.
kubectl get events -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events -n <namespace> --field-selector type=Warning
kubectl get events -n <namespace> -w # watch for new events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # cluster-wide recent
kubectl top
View resource usage metrics (requires metrics-server).
kubectl top pods -n <namespace>
kubectl top pods -n <namespace> --sort-by=memory
kubectl top nodes
kubectl top pod <pod-name> -n <namespace> --containers
The Four Phases of Kubernetes Debugging
Phase 1: Gather Information
Start broad, then narrow down. Never assume the cause.
# Quick status overview
kubectl get pods,svc,deploy,rs -n <namespace>
# Recent events (often reveals the issue immediately)
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Describe the problematic resource
kubectl describe <resource-type> <name> -n <namespace>
Phase 2: Check Logs and Metrics
Logs reveal application-level issues; metrics reveal resource issues.
# Application logs
kubectl logs <pod-name> -n <namespace> --tail=200
kubectl logs <pod-name> -n <namespace> --previous # if crashed
# Resource metrics
kubectl top pod <pod-name> -n <namespace>
kubectl top nodes
Phase 3: Interactive Investigation
Get inside the environment when logs aren’t enough.
# Shell into the container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Use ephemeral debug container
kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot
# Check connectivity from inside
wget -qO- http://service-name:port
nslookup service-name
curl -v http://endpoint
Phase 4: Validate and Fix
Make changes, verify they work, document the solution.
# Apply fix
kubectl apply -f fixed-manifest.yaml
# Watch for success
kubectl get pods -n <namespace> -w
kubectl get events -n <namespace> -w
# Verify health
kubectl logs <pod-name> -n <namespace> -f
Quick Reference Commands
Essential One-Liners
# Get all pods with their status across namespaces
kubectl get pods -A -o wide
# Find pods not in Running state
kubectl get pods -A --field-selector=status.phase!=Running
# Get pod restart counts
kubectl get pods -n <namespace> -o=custom-columns='NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount'
# Show pod resource requests and limits
kubectl get pods -n <namespace> -o=custom-columns='NAME:.metadata.name,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory'
# Get events sorted by time (most recent last)
kubectl get events --sort-by='.lastTimestamp' -n <namespace>
# Watch pods in real-time
kubectl get pods -n <namespace> -w
# Get logs from all pods with label
kubectl logs -l app=myapp -n <namespace> --all-containers=true
# Check endpoints for a service
kubectl get endpoints <service-name> -n <namespace>
# Test DNS resolution
kubectl run dnstest --rm -it --image=busybox --restart=Never -- nslookup <service-name>.<namespace>
# Check RBAC permissions
kubectl auth can-i --list --as=system:serviceaccount:<namespace>:<sa-name>
Debugging Network Issues
# Run network debug pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash
# Test service connectivity
kubectl run curl --rm -it --image=curlimages/curl -- curl -v http://<service>:<port>
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Trace network path
kubectl exec -it <pod> -- traceroute <destination>
Debugging Storage Issues
# List all PVCs with status
kubectl get pvc -A
# Describe PVC for binding issues
kubectl describe pvc <name> -n <namespace>
# Check storage provisioner logs
kubectl logs -n kube-system -l app=<provisioner-name>
# Verify mount inside pod
kubectl exec -it <pod> -n <namespace> -- df -h
kubectl exec -it <pod> -n <namespace> -- ls -la /path/to/mount
Useful Debug Images
| Image | Use Case |
|---|---|
busybox |
Basic shell, networking tools |
nicolaka/netshoot |
Comprehensive network debugging |
curlimages/curl |
HTTP testing |
alpine |
Minimal Linux with package manager |
gcr.io/kubernetes-e2e-test-images/jessie-dnsutils |
DNS debugging |
Prevention Best Practices
- Always set resource requests and limits – Prevents noisy neighbor issues and OOMKilled
- Configure proper health probes – Liveness, readiness, and startup probes with appropriate delays
- Use namespaces – Isolate workloads for easier debugging
- Label everything – Makes filtering and selection reliable
- Implement monitoring – Prometheus, Grafana, ELK stack for visibility
- Validate manifests before deployment – Use kubeval, kube-linter, or similar tools
- Use GitOps – Track changes and enable rollback
- Document runbooks – For common issues specific to your applications