kubernetes-troubleshooting

📁 latestaiagents/agent-skills 📅 8 days ago

总安装量

周安装量

#76159

全站排名

安装命令

npx skills add https://github.com/latestaiagents/agent-skills --skill kubernetes-troubleshooting

Agent 安装分布

mcpjam 1

claude-code 1

replit 1

windsurf 1

zencoder 1

Skill 文档

Kubernetes Troubleshooting

Systematic approaches to diagnose and fix common Kubernetes issues.

Troubleshooting Framework

1. What's the symptom? (pod not starting, service unreachable, etc.)
2. Where's the problem? (pod, service, ingress, node, cluster)
3. What do the events say?
4. What do the logs say?
5. What changed recently?

Pod Issues

Pod Status Quick Reference

Status	Meaning	First Check
Pending	Can’t be scheduled	`kubectl describe pod`
ContainerCreating	Image pulling or volume mounting	Events, `kubectl get events`
CrashLoopBackOff	Container crashes repeatedly	`kubectl logs --previous`
ImagePullBackOff	Can’t pull container image	Image name, credentials
Error	Container exited with error	`kubectl logs`
OOMKilled	Out of memory	Increase memory limits
Evicted	Node under pressure	Node resources, pod priority

Debugging Commands

# Get pod status
kubectl get pod <pod-name> -o wide

# Describe pod (events, conditions)
kubectl describe pod <pod-name>

# Get logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container>  # specific container
kubectl logs <pod-name> --previous       # previous crash

# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh

# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp'

CrashLoopBackOff

Symptoms: Pod restarts repeatedly
Common Causes:
ââ Application error on startup
ââ Missing config/secrets
ââ Liveness probe failing too soon
ââ Resource limits too low
ââ Dependency not ready

Debug Steps:
1. kubectl logs <pod> --previous
2. kubectl describe pod <pod>  # check events
3. Check liveness probe configuration
4. Check resource limits
5. Verify ConfigMaps/Secrets exist

ImagePullBackOff

Symptoms: Container image can't be pulled
Common Causes:
ââ Image doesn't exist
ââ Wrong image name/tag
ââ Private registry, missing credentials
ââ Registry rate limiting
ââ Network issues

Debug Steps:
1. Verify image name: kubectl describe pod <pod>
2. Try pulling manually: docker pull <image>
3. Check imagePullSecrets in pod spec
4. Verify secret exists: kubectl get secret <secret-name>
5. Check registry status

Pending Pods

Symptoms: Pod stuck in Pending state
Common Causes:
ââ Insufficient resources (CPU/memory)
ââ No nodes match nodeSelector/affinity
ââ PVC can't be bound
ââ Taint with no toleration
ââ ResourceQuota exceeded

Debug Steps:
1. kubectl describe pod <pod>  # check Events
2. kubectl get nodes -o wide   # check node capacity
3. kubectl describe node <node> # check allocatable
4. kubectl get pvc              # check volume claims
5. kubectl get resourcequota    # check quotas

Service & Networking Issues

Service Not Working

# Check service exists and has endpoints
kubectl get svc <service>
kubectl get endpoints <service>

# If no endpoints, check selector matches pods
kubectl get pods -l <selector-from-service>

# Test from inside cluster
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>

# Check DNS resolution
kubectl run test --rm -it --image=busybox -- nslookup <service>

Debugging Checklist

â¡ Service exists and has correct port
â¡ Endpoints exist (pods are selected)
â¡ Pod selector labels match
â¡ Pods are Running and Ready
â¡ Container is listening on correct port
â¡ NetworkPolicy isn't blocking traffic
â¡ DNS resolves correctly

Ingress Issues

# Check ingress configuration
kubectl describe ingress <ingress-name>

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify backend service
kubectl get svc <backend-service>

# Check TLS secret
kubectl get secret <tls-secret>

Deployment Issues

Deployment Not Rolling Out

# Check rollout status
kubectl rollout status deployment/<name>

# Check deployment events
kubectl describe deployment <name>

# Check replicaset
kubectl get rs -l app=<name>
kubectl describe rs <replicaset-name>

# Rollback if needed
kubectl rollout undo deployment/<name>

Common Deployment Problems

Symptom: New pods not creating
Check:
ââ ResourceQuota limits
ââ PodDisruptionBudget blocking
ââ Node capacity

Symptom: Old pods not terminating
Check:
ââ terminationGracePeriodSeconds
ââ PreStop hooks stuck
ââ Finalizers blocking deletion

Symptom: Rollout stuck
Check:
ââ maxUnavailable settings
ââ Readiness probe never passes
ââ PVC can't be detached

Node Issues

Node Not Ready

# Check node status
kubectl get nodes
kubectl describe node <node>

# Check node conditions
kubectl get node <node> -o jsonpath='{.status.conditions[*].type}'

# Check kubelet logs (on node)
journalctl -u kubelet -f

# Drain node if needed
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Resource Pressure

# Check node resources
kubectl top nodes

# Check which pods are using resources
kubectl top pods --all-namespaces

# Find pods on specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>

Quick Diagnostic Commands

# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Specific namespace health
kubectl get all -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>

# Network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

Systematic Debug Template

## Issue: [Brief description]

### Symptom
[What's happening]

### Affected Resources
- Namespace:
- Deployment/Pod:
- Service:

### Investigation

#### Step 1: Check Status

kubectl get pod kubectl describe pod

Findings: [...]

#### Step 2: Check Logs

kubectl logs

Findings: [...]

#### Step 3: Check Events

kubectl get events –sort-by=’.lastTimestamp’

Findings: [...]

### Root Cause
[What caused the issue]

### Resolution
[What fixed it]

### Prevention
[How to prevent recurrence]

Emergency Procedures

Force Delete Stuck Pod

# Only use when pod is truly stuck
kubectl delete pod <pod> --grace-period=0 --force

Emergency Rollback

# Immediate rollback
kubectl rollout undo deployment/<name>

# Rollback to specific revision
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name> --to-revision=<n>

Scale Down Quickly

# Scale to zero
kubectl scale deployment/<name> --replicas=0

# Scale back up
kubectl scale deployment/<name> --replicas=3

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台