k8s-debug
9
总安装量
7
周安装量
#31580
全站排名
安装命令
npx skills add https://github.com/incidentfox/incidentfox --skill k8s-debug
Agent 安装分布
amp
7
gemini-cli
7
claude-code
7
github-copilot
7
codex
7
kimi-cli
7
Skill 文档
Kubernetes Debugging Expertise
Golden Rule: Events Before Logs
When debugging Kubernetes issues, ALWAYS check events first:
get_pod_events– Shows scheduling, pulling, starting, probes, OOM- THEN
get_pod_logs– Application-level errors
Events explain most crash/scheduling issues faster than logs.
Typical Investigation Flow
1. list_pods â Get overview of pod health in namespace
2. get_pod_events â Understand WHY pods are in their state
3. get_pod_logs â Only if events don't explain the issue
4. get_pod_resources â For performance/resource issues
5. describe_deployment â Check deployment status and conditions
Common Issue Patterns
CrashLoopBackOff
First check: get_pod_events
| Event Reason | Likely Cause | Next Step |
|---|---|---|
| OOMKilled | Memory limit too low or memory leak | Check get_pod_resources, increase limits |
| Error | Application crash | Check get_pod_logs for stack trace |
| BackOff | Repeated failures | Check logs for startup errors |
Checklist:
- Memory limits vs actual usage
- Recent deployment changes (
get_deployment_history) - Missing config/secrets
- Dependency failures (database, external services)
OOMKilled
First check: get_pod_events (confirms OOMKilled)
Then: get_pod_resources (compare usage to limits)
Common causes:
- Memory limit set too low for workload
- Memory leak (usage increases over time)
- Sudden traffic spike causing memory pressure
- Large request payloads cached in memory
ImagePullBackOff
First check: get_pod_events
Common causes:
- Wrong image name or tag
- Private registry without imagePullSecrets
- Rate limiting from registry
- Network issues reaching registry
Pending Pods
First check: get_pod_events
Look for:
FailedScheduling– Insufficient resourcesUnschedulable– Node affinity/taints- No matching nodes for nodeSelector
Readiness/Liveness Probe Failures
First check: describe_pod (shows probe config)
Then: get_pod_events (probe failure events)
Then: get_pod_logs (why endpoint isn’t responding)
Evicted Pods
First check: get_pod_events
Causes:
- Node resource pressure (disk, memory)
- Priority preemption
- Taint-based eviction
Deployment Issues
Stuck Rollout
describe_deployment â Check replicas (desired vs ready vs available)
get_deployment_history â Compare current vs previous revision
get_pod_events â For pods in new ReplicaSet
Common causes:
- New pods failing (CrashLoopBackOff)
- Readiness probes failing
- Resource constraints preventing scheduling
Rollback Decision
Use get_deployment_history to see previous working versions.
Error Classification
Non-Retryable (Stop Immediately)
- 401 Unauthorized – Invalid credentials
- 403 Forbidden – No permission
- 404 Not Found – Resource doesn’t exist
- “config_required”: true – Integration not configured
Retryable (May retry once)
- 429 Too Many Requests
- 500/502/503/504 Server errors
- Timeout
- Connection refused
Resource Investigation Pattern
For memory/CPU issues:
1. get_pod_resources â See allocation vs usage
2. describe_pod â See full container spec
3. get_cloudwatch_metrics/query_datadog_metrics â Historical usage
4. detect_anomalies on historical data â Find when issue started