k8s-debug
1
总安装量
1
周安装量
#42094
全站排名
安装命令
npx skills add https://github.com/martin-janci/claude-marketplace --skill k8s-debug
Agent 安装分布
cursor
1
claude-code
1
gemini-cli
1
Skill 文档
Kubernetes Troubleshooting
Diagnostic Commands Cheatsheet
# Quick cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Namespace health
kubectl get all -n <ns>
kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl top pods -n <ns>
Pod Status Troubleshooting
CrashLoopBackOff
Symptoms: Pod restarts repeatedly, status shows CrashLoopBackOff
Diagnosis:
# Check exit code and reason
kubectl describe pod <pod> -n <ns> | grep -A10 "Last State"
# View logs from crashed container
kubectl logs <pod> -n <ns> --previous
# Check if OOMKilled
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
Common Causes:
- Application crash (check logs)
- OOMKilled (increase memory limits)
- Liveness probe too aggressive
- Missing dependencies/config
- Permission issues
Fixes:
# Increase memory if OOMKilled
kubectl set resources deployment/<name> -n <ns> --limits=memory=1Gi
# Disable liveness probe temporarily for debugging
kubectl edit deployment/<name> -n <ns> # Remove livenessProbe
# Shell into running container to debug
kubectl exec -it <pod> -n <ns> -- /bin/sh
ImagePullBackOff
Symptoms: Pod stuck in ImagePullBackOff or ErrImagePull
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A5 "Events"
# Look for: "Failed to pull image" messages
Common Causes:
- Image doesn’t exist (typo in name/tag)
- Private registry without imagePullSecrets
- Registry authentication failed
- Network issues to registry
Fixes:
# Verify image exists
docker pull <image>
# Check imagePullSecrets
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.imagePullSecrets}'
# Create registry secret
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<pass> \
-n <ns>
# Add to deployment
kubectl patch deployment <name> -n <ns> -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'
Pending
Symptoms: Pod stuck in Pending state
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A10 "Events"
# Look for scheduling failure reasons
Common Causes:
- Insufficient CPU/memory
- Node selector/affinity mismatch
- Taint without toleration
- PVC not bound
- Resource quota exceeded
Fixes:
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# Check taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check PVC status
kubectl get pvc -n <ns>
# Check resource quota
kubectl describe resourcequota -n <ns>
CreateContainerConfigError
Symptoms: Pod in CreateContainerConfigError state
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A5 "Warning"
Common Causes:
- ConfigMap doesn’t exist
- Secret doesn’t exist
- Volume mount issues
Fixes:
# Check if configmap exists
kubectl get configmap <name> -n <ns>
# Check if secret exists
kubectl get secret <name> -n <ns>
# List expected volumes
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.volumes[*].name}'
OOMKilled
Symptoms: Container terminated with OOMKilled reason, exit code 137
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"
kubectl top pod <pod> -n <ns>
# Check limits
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].resources}'
Fixes:
# Increase memory limit
kubectl set resources deployment/<name> -n <ns> --limits=memory=2Gi
# Or patch
kubectl patch deployment <name> -n <ns> --type='json' -p='[
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"2Gi"}
]'
Service & Networking Issues
Service Not Reaching Pods
Diagnosis:
# Check endpoints
kubectl get endpoints <service> -n <ns>
# Empty endpoints = selector mismatch or no ready pods
# Check service selector
kubectl get svc <service> -n <ns> -o jsonpath='{.spec.selector}'
# Check pod labels
kubectl get pods -n <ns> --show-labels
# Test from within cluster
kubectl run debug --rm -it --image=busybox -n <ns> -- wget -qO- <service>:<port>
Common Causes:
- Selector doesn’t match pod labels
- No ready pods (readiness probe failing)
- Wrong port configuration
- Network policy blocking traffic
Fixes:
# Fix selector
kubectl patch svc <service> -n <ns> -p '{"spec":{"selector":{"app":"correct-label"}}}'
# Check readiness
kubectl get pods -n <ns> -o wide # Check READY column
kubectl describe pod <pod> -n <ns> | grep -A10 "Readiness"
DNS Issues
Diagnosis:
# Test DNS from pod
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default
kubectl run dns-test --rm -it --image=busybox -- nslookup <service>.<namespace>.svc.cluster.local
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
Ingress Not Working
Diagnosis:
# Check ingress status
kubectl describe ingress <name> -n <ns>
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Verify backend service
kubectl get svc <backend-service> -n <ns>
kubectl get endpoints <backend-service> -n <ns>
Volume Issues
PVC Stuck in Pending
Diagnosis:
kubectl describe pvc <name> -n <ns>
kubectl get storageclass
kubectl get pv
Common Causes:
- No matching StorageClass
- StorageClass provisioner not working
- Zone/region mismatch
- Capacity exceeded
Volume Mount Failures
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A10 "Volumes"
kubectl describe pod <pod> -n <ns> | grep -A10 "Mounts"
kubectl get events -n <ns> | grep -i volume
Node Issues
Node NotReady
Diagnosis:
kubectl describe node <node> | grep -A10 "Conditions"
kubectl get events --field-selector involvedObject.name=<node>
# Check kubelet logs (on node)
journalctl -u kubelet -n 100
Common Causes:
- Kubelet not running
- Network issues
- Disk pressure
- Memory pressure
- Too many pods
Drain & Cordon
# Mark node unschedulable
kubectl cordon <node>
# Drain (evict pods)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Uncordon
kubectl uncordon <node>
Resource Debugging
Check Resource Pressure
# Node resources
kubectl top nodes
kubectl describe nodes | grep -A5 "Allocated resources"
# Pod resources
kubectl top pods -n <ns> --sort-by=memory
kubectl top pods -n <ns> --sort-by=cpu
# Resource quotas
kubectl describe resourcequota -n <ns>
kubectl describe limitrange -n <ns>
Find Resource Hogs
# Top memory consumers
kubectl top pods -A --sort-by=memory | head -20
# Top CPU consumers
kubectl top pods -A --sort-by=cpu | head -20
# Pods without limits
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | .metadata.namespace + "/" + .metadata.name'
Debug Containers
Ephemeral Debug Container (k8s 1.25+)
# Add debug container to running pod
kubectl debug -it <pod> -n <ns> --image=busybox --target=<container>
# Debug with network tools
kubectl debug -it <pod> -n <ns> --image=nicolaka/netshoot --target=<container>
Debug Pod Copy
# Create copy with different command
kubectl debug <pod> -n <ns> --copy-to=debug-pod --container=<container> -- sleep infinity
# Then exec into it
kubectl exec -it debug-pod -n <ns> -- /bin/sh
Standalone Debug Pod
# Network debugging
kubectl run netshoot --rm -it --image=nicolaka/netshoot -n <ns> -- /bin/bash
# Common debug commands inside:
curl -v http://<service>:<port>/
nslookup <service>
ping <pod-ip>
traceroute <service>
nc -zv <service> <port>
Log Analysis
# Follow logs
kubectl logs -f <pod> -n <ns>
# All containers
kubectl logs <pod> -n <ns> --all-containers
# Previous container (after crash)
kubectl logs <pod> -n <ns> --previous
# Since time
kubectl logs <pod> -n <ns> --since=1h
kubectl logs <pod> -n <ns> --since-time="2024-01-01T00:00:00Z"
# Tail specific lines
kubectl logs <pod> -n <ns> --tail=100
# Multiple pods
kubectl logs -l app=<label> -n <ns> --all-containers
# With timestamps
kubectl logs <pod> -n <ns> --timestamps
Events
# Namespace events
kubectl get events -n <ns> --sort-by='.lastTimestamp'
# Watch events
kubectl get events -n <ns> -w
# Filter warnings
kubectl get events -n <ns> --field-selector type=Warning
# Pod-specific events
kubectl get events -n <ns> --field-selector involvedObject.name=<pod>
Quick Diagnostic Script
#!/usr/bin/env bash
# k8s-diag.sh - Quick namespace diagnostic
NS="${1:?Usage: $0 <namespace>}"
echo "=== Namespace: $NS ==="
echo ""
echo "--- Pod Status ---"
kubectl get pods -n "$NS" -o wide
echo ""
echo "--- Non-Running Pods ---"
kubectl get pods -n "$NS" | grep -v Running | grep -v Completed
echo ""
echo "--- Recent Events ---"
kubectl get events -n "$NS" --sort-by='.lastTimestamp' | tail -20
echo ""
echo "--- Resource Usage ---"
kubectl top pods -n "$NS" 2>/dev/null || echo "Metrics not available"
echo ""
echo "--- Services ---"
kubectl get svc -n "$NS"
echo ""
echo "--- Endpoints ---"
kubectl get endpoints -n "$NS"
Common Fixes Quick Reference
| Problem | Quick Fix |
|---|---|
| CrashLoopBackOff | kubectl logs <pod> --previous then fix app |
| ImagePullBackOff | Verify image name, create imagePullSecret |
| Pending (resources) | Scale down other pods or add nodes |
| Pending (PVC) | Check StorageClass and provisioner |
| OOMKilled | Increase memory limits |
| Readiness failing | Check probe endpoint, increase timeout |
| No endpoints | Fix service selector to match pod labels |
| DNS failure | Check CoreDNS pods in kube-system |
| Permission denied | Check RBAC, ServiceAccount |