kubernetes-debug

📁 incidentfox/incidentfox 📅 9 days ago

总安装量

周安装量

#31677

全站排名

安装命令

npx skills add https://github.com/incidentfox/incidentfox --skill kubernetes-debug

Agent 安装分布

amp 8

gemini-cli 8

claude-code 8

github-copilot 8

codex 8

kimi-cli 8

Skill 文档

Kubernetes Debugging

Core Principle: Events Before Logs

ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:

OOMKilled â Memory limit exceeded
ImagePullBackOff â Image not found or auth issue
FailedScheduling â No nodes with enough resources
CrashLoopBackOff â Container crashing repeatedly

Available Scripts

All scripts are in .claude/skills/infrastructure-kubernetes/scripts/

list_pods.py – List pods with status

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment

get_events.py – Get pod events (USE FIRST!)

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>

# Example:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo

get_logs.py – Get pod logs

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment

describe_pod.py – Detailed pod info

python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>

get_resources.py – Resource usage vs limits

python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

describe_deployment.py – Deployment status

python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace>

get_history.py – Rollout history

python .claude/skills/infrastructure-kubernetes/scripts/get_history.py <deployment-name> -n <namespace>

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

list_pods.py – Check pod status
get_events.py – Look for scheduling/pull/crash events
describe_pod.py – Check conditions and container states
get_logs.py – Only if events don’t explain

Pod Restarting (OOMKilled/Crashes)

get_events.py – Check for OOMKilled or error events
get_resources.py – Compare usage vs limits
get_logs.py – Check for errors before crash
describe_pod.py – Check restart count and state

Deployment Not Progressing

describe_deployment.py – Check replica counts
list_pods.py – Find stuck pods
get_events.py – Check events on stuck pods
get_history.py – Check rollout history for rollback

Common Issues & Solutions

Event Reason	Meaning	Action
OOMKilled	Container exceeded memory limit	Increase limits or fix memory leak
ImagePullBackOff	Can’t pull image	Check image name, registry auth
CrashLoopBackOff	Container keeps crashing	Check logs for startup errors
FailedScheduling	No node can run pod	Check node resources, taints
Unhealthy	Liveness probe failed	Check probe config, app health

Output Format

When reporting findings, use this structure:

## Kubernetes Analysis

**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)

### Events
- [timestamp] <reason>: <message>

### Issues Found
1. [Issue description with evidence]

### Root Cause Hypothesis
[Based on events and logs]

### Recommended Action
[Specific remediation step]

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台