debug:kubernetes

📁 snakeo/claude-debug-and-refactor-skills-plugin 📅 Jan 19, 2026

总安装量

周安装量

#22633

全站排名

安装命令

npx skills add https://github.com/snakeo/claude-debug-and-refactor-skills-plugin --skill debug:kubernetes

Agent 安装分布

claude-code 13

opencode 11

gemini-cli 10

windsurf 9

github-copilot 9

Skill 文档

Kubernetes Debugging Guide

A systematic approach to diagnosing and resolving Kubernetes issues. Always start with the basics: check events and logs first.

Common Error Patterns

CrashLoopBackOff

What it means: Container repeatedly crashes and fails to start. Kubernetes restarts it with exponential backoff (10s, 20s, 40s… up to 5 minutes).

Common causes:

Insufficient memory/CPU resources
Missing dependencies in container image
Misconfigured liveness/readiness probes
Application code errors or misconfigurations
Missing environment variables or secrets

Debug steps:

# Check pod events and status
kubectl describe pod <pod-name> -n <namespace>

# View current and previous container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check resource limits vs actual usage
kubectl top pod <pod-name> -n <namespace>

Solutions:

Tune probe initialDelaySeconds and timeoutSeconds
Increase resource limits if hitting memory/CPU caps
Fix missing dependencies in Dockerfile
Review application startup code for errors

ImagePullBackOff

What it means: Kubernetes cannot pull the container image. Retries with increasing delay (5s, 10s, 20s… up to 5 minutes).

Common causes:

Incorrect image name or tag
Missing registry authentication credentials
Private registry without imagePullSecrets configured
Network connectivity issues to registry
Image does not exist in registry

Debug steps:

# Check pod events for specific error
kubectl describe pod <pod-name> -n <namespace>

# Verify image name in deployment
kubectl get deployment <name> -n <namespace> -o yaml | grep image:

# Check if imagePullSecrets are configured
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 imagePullSecrets

# Test pulling image from node (if you have node access)
docker pull <image-name>

Solutions:

Correct image name/tag in deployment spec
Create and attach imagePullSecret for private registries
Verify network access to container registry
Check registry credentials haven’t expired

Pending Pods (Scheduling Failures)

What it means: Pod cannot be scheduled to any node.

Common causes:

Insufficient cluster resources (CPU/memory)
Node selectors or affinity rules cannot be satisfied
Taints without matching tolerations
PersistentVolumeClaim not bound
Resource quotas exceeded

Debug steps:

# Check why pod is pending
kubectl describe pod <pod-name> -n <namespace>

# View cluster events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top nodes

# Check PVC status if using persistent storage
kubectl get pvc -n <namespace>

Solutions:

Scale up cluster or reduce resource requests
Adjust nodeSelector/affinity rules
Add tolerations for node taints
Create or fix PersistentVolume bindings
Increase namespace resource quotas

OOMKilled

What it means: Container was forcefully terminated (SIGKILL, exit code 137) for exceeding memory limit.

Common causes:

Memory limit set too low for application
Memory leak in application code
Processing large files or datasets
High concurrency causing memory spikes
JVM/runtime heap misconfiguration

Debug steps:

# Check termination reason
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Last State"

# View logs before termination
kubectl logs <pod-name> -n <namespace> --previous

# Check memory limits vs usage
kubectl top pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 resources:

Solutions:

Increase memory limits in deployment spec
Profile application for memory leaks
Configure application memory settings (e.g., JVM -Xmx)
Implement memory-efficient processing patterns
Add horizontal pod autoscaling for load distribution

Service Not Reachable

What it means: Cannot connect to service from within or outside cluster.

Common causes:

Service selector doesn’t match pod labels
Pod not ready (failing readiness probe)
NetworkPolicy blocking traffic
Service port mismatch with container port
Ingress/LoadBalancer misconfiguration

Debug steps:

# Check service and endpoints
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Verify pod labels match service selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A5 selector

# Test connectivity from within cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service-name>.<namespace>:port

# Check network policies
kubectl get networkpolicy -n <namespace>

Solutions:

Fix service selector to match pod labels
Ensure pods are passing readiness probes
Update NetworkPolicy to allow required traffic
Verify port configurations match

PVC Binding Failures

What it means: PersistentVolumeClaim cannot bind to a PersistentVolume.

Common causes:

No PV available matching PVC requirements
StorageClass not found or misconfigured
Access mode mismatch (RWO vs RWX)
Storage capacity insufficient
Zone/region constraints not met

Debug steps:

# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# List available PVs
kubectl get pv

# Check StorageClass
kubectl get storageclass
kubectl describe storageclass <name>

# View provisioner events
kubectl get events -n <namespace> --field-selector reason=ProvisioningFailed

Solutions:

Create matching PersistentVolume manually
Fix StorageClass name or create required class
Adjust access mode or capacity requirements
Enable dynamic provisioning if available

RBAC Permission Denied

What it means: Service account lacks required permissions for API operations.

Common causes:

Missing Role or ClusterRole
RoleBinding not created for service account
Wrong namespace for RoleBinding
Insufficient permissions in Role

Debug steps:

# Check what service account pod uses
kubectl get pod <pod-name> -n <namespace> -o yaml | grep serviceAccountName

# Test permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<sa-name>

# List roles and bindings
kubectl get roles,rolebindings -n <namespace>
kubectl get clusterroles,clusterrolebindings | grep <relevant-name>

# Describe specific binding
kubectl describe rolebinding <name> -n <namespace>

Solutions:

Create Role/ClusterRole with required permissions
Create RoleBinding/ClusterRoleBinding
Verify binding references correct service account
Use namespace-scoped roles when possible

Debugging Tools Reference

kubectl describe

Get detailed information about any resource including events.

kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name>
kubectl describe svc <service-name> -n <namespace>
kubectl describe deployment <name> -n <namespace>

kubectl logs

View container stdout/stderr logs.

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> -c <container-name>  # specific container
kubectl logs <pod-name> -n <namespace> --previous           # previous instance
kubectl logs <pod-name> -n <namespace> -f                   # follow/stream
kubectl logs <pod-name> -n <namespace> --tail=100           # last 100 lines
kubectl logs -l app=myapp -n <namespace>                    # by label selector

kubectl exec

Execute commands inside running container.

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
kubectl exec <pod-name> -n <namespace> -- cat /etc/config/app.conf
kubectl exec <pod-name> -n <namespace> -c <container> -- env  # specific container

kubectl debug

Debug nodes or pods with ephemeral containers (K8s 1.23+).

# Debug a running pod with ephemeral container
kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container>

# Debug a node
kubectl debug node/<node-name> -it --image=busybox

# Create debug copy of pod with different image
kubectl debug <pod-name> -it --copy-to=debug-pod --container=app --image=busybox

kubectl get events

View cluster events for troubleshooting.

kubectl get events -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events -n <namespace> --field-selector type=Warning
kubectl get events -n <namespace> -w  # watch for new events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20  # cluster-wide recent

kubectl top

View resource usage metrics (requires metrics-server).

kubectl top pods -n <namespace>
kubectl top pods -n <namespace> --sort-by=memory
kubectl top nodes
kubectl top pod <pod-name> -n <namespace> --containers

The Four Phases of Kubernetes Debugging

Phase 1: Gather Information

Start broad, then narrow down. Never assume the cause.

# Quick status overview
kubectl get pods,svc,deploy,rs -n <namespace>

# Recent events (often reveals the issue immediately)
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Describe the problematic resource
kubectl describe <resource-type> <name> -n <namespace>

Phase 2: Check Logs and Metrics

Logs reveal application-level issues; metrics reveal resource issues.

# Application logs
kubectl logs <pod-name> -n <namespace> --tail=200
kubectl logs <pod-name> -n <namespace> --previous  # if crashed

# Resource metrics
kubectl top pod <pod-name> -n <namespace>
kubectl top nodes

Phase 3: Interactive Investigation

Get inside the environment when logs aren’t enough.

# Shell into the container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Use ephemeral debug container
kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot

# Check connectivity from inside
wget -qO- http://service-name:port
nslookup service-name
curl -v http://endpoint

Phase 4: Validate and Fix

Make changes, verify they work, document the solution.

# Apply fix
kubectl apply -f fixed-manifest.yaml

# Watch for success
kubectl get pods -n <namespace> -w
kubectl get events -n <namespace> -w

# Verify health
kubectl logs <pod-name> -n <namespace> -f

Quick Reference Commands

Essential One-Liners

# Get all pods with their status across namespaces
kubectl get pods -A -o wide

# Find pods not in Running state
kubectl get pods -A --field-selector=status.phase!=Running

# Get pod restart counts
kubectl get pods -n <namespace> -o=custom-columns='NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount'

# Show pod resource requests and limits
kubectl get pods -n <namespace> -o=custom-columns='NAME:.metadata.name,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory'

# Get events sorted by time (most recent last)
kubectl get events --sort-by='.lastTimestamp' -n <namespace>

# Watch pods in real-time
kubectl get pods -n <namespace> -w

# Get logs from all pods with label
kubectl logs -l app=myapp -n <namespace> --all-containers=true

# Check endpoints for a service
kubectl get endpoints <service-name> -n <namespace>

# Test DNS resolution
kubectl run dnstest --rm -it --image=busybox --restart=Never -- nslookup <service-name>.<namespace>

# Check RBAC permissions
kubectl auth can-i --list --as=system:serviceaccount:<namespace>:<sa-name>

Debugging Network Issues

# Run network debug pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# Test service connectivity
kubectl run curl --rm -it --image=curlimages/curl -- curl -v http://<service>:<port>

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Trace network path
kubectl exec -it <pod> -- traceroute <destination>

Debugging Storage Issues

# List all PVCs with status
kubectl get pvc -A

# Describe PVC for binding issues
kubectl describe pvc <name> -n <namespace>

# Check storage provisioner logs
kubectl logs -n kube-system -l app=<provisioner-name>

# Verify mount inside pod
kubectl exec -it <pod> -n <namespace> -- df -h
kubectl exec -it <pod> -n <namespace> -- ls -la /path/to/mount

Useful Debug Images

Image	Use Case
`busybox`	Basic shell, networking tools
`nicolaka/netshoot`	Comprehensive network debugging
`curlimages/curl`	HTTP testing
`alpine`	Minimal Linux with package manager
`gcr.io/kubernetes-e2e-test-images/jessie-dnsutils`	DNS debugging

Prevention Best Practices

Always set resource requests and limits – Prevents noisy neighbor issues and OOMKilled
Configure proper health probes – Liveness, readiness, and startup probes with appropriate delays
Use namespaces – Isolate workloads for easier debugging
Label everything – Makes filtering and selection reliable
Implement monitoring – Prometheus, Grafana, ELK stack for visibility
Validate manifests before deployment – Use kubeval, kube-linter, or similar tools
Use GitOps – Track changes and enable rollback
Document runbooks – For common issues specific to your applications

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台