kubernetes-debug
9
总安装量
8
周安装量
#31677
全站排名
安装命令
npx skills add https://github.com/incidentfox/incidentfox --skill kubernetes-debug
Agent 安装分布
amp
8
gemini-cli
8
claude-code
8
github-copilot
8
codex
8
kimi-cli
8
Skill 文档
Kubernetes Debugging
Core Principle: Events Before Logs
ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:
- OOMKilled â Memory limit exceeded
- ImagePullBackOff â Image not found or auth issue
- FailedScheduling â No nodes with enough resources
- CrashLoopBackOff â Container crashing repeatedly
Available Scripts
All scripts are in .claude/skills/infrastructure-kubernetes/scripts/
list_pods.py – List pods with status
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
get_events.py – Get pod events (USE FIRST!)
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>
# Example:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
get_logs.py – Get pod logs
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
describe_pod.py – Detailed pod info
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>
get_resources.py – Resource usage vs limits
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>
describe_deployment.py – Deployment status
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace>
get_history.py – Rollout history
python .claude/skills/infrastructure-kubernetes/scripts/get_history.py <deployment-name> -n <namespace>
Debugging Workflows
Pod Not Starting (Pending/CrashLoopBackOff)
list_pods.py– Check pod statusget_events.py– Look for scheduling/pull/crash eventsdescribe_pod.py– Check conditions and container statesget_logs.py– Only if events don’t explain
Pod Restarting (OOMKilled/Crashes)
get_events.py– Check for OOMKilled or error eventsget_resources.py– Compare usage vs limitsget_logs.py– Check for errors before crashdescribe_pod.py– Check restart count and state
Deployment Not Progressing
describe_deployment.py– Check replica countslist_pods.py– Find stuck podsget_events.py– Check events on stuck podsget_history.py– Check rollout history for rollback
Common Issues & Solutions
| Event Reason | Meaning | Action |
|---|---|---|
| OOMKilled | Container exceeded memory limit | Increase limits or fix memory leak |
| ImagePullBackOff | Can’t pull image | Check image name, registry auth |
| CrashLoopBackOff | Container keeps crashing | Check logs for startup errors |
| FailedScheduling | No node can run pod | Check node resources, taints |
| Unhealthy | Liveness probe failed | Check probe config, app health |
Output Format
When reporting findings, use this structure:
## Kubernetes Analysis
**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)
### Events
- [timestamp] <reason>: <message>
### Issues Found
1. [Issue description with evidence]
### Root Cause Hypothesis
[Based on events and logs]
### Recommended Action
[Specific remediation step]