incident-response

📁 nik-kale/sre-skills 📅 14 days ago

总安装量

周安装量

#41114

全站排名

安装命令

npx skills add https://github.com/nik-kale/sre-skills --skill incident-response

Agent 安装分布

mcpjam 1

openhands 1

kilo 1

junie 1

zencoder 1

Skill 文档

Incident Response

Systematic framework for investigating and resolving production incidents.

When to Use This Skill

Production alert firing
Service outage or degradation reported
Error rates spiking
User-reported issues affecting production
On-call escalation received

Initial Triage (First 5 Minutes)

Copy and track progress:

Incident Triage:
- [ ] Confirm the incident is real (not false positive)
- [ ] Identify affected service(s)
- [ ] Assess severity level
- [ ] Start incident channel/thread
- [ ] Page additional responders if needed

Severity Assessment

Quickly determine severity to guide response urgency:

Severity	Criteria	Response Time
SEV1	Complete outage, data loss risk, security breach	Immediate, all hands
SEV2	Major degradation, significant user impact	< 15 min, primary on-call
SEV3	Partial degradation, limited user impact	< 1 hour
SEV4	Minor issue, workaround available	Next business day

For detailed severity definitions, see references/severity-levels.md.

Data Gathering

Collect evidence systematically. Don’t jump to conclusions.

1. Timeline Construction

Timeline:
- [TIME] First alert/report
- [TIME] ...
- [TIME] Current status

2. Key Data Sources

Metrics – Check in this order:

Error rates (5xx, exceptions)
Latency (p50, p95, p99)
Traffic volume (requests/sec)
Resource utilization (CPU, memory, disk, connections)
Dependency health

Logs – Search for:

# Error patterns
level:error OR level:fatal
exception OR panic OR crash

# Correlation
trace_id:<id> OR request_id:<id>

Traces – Find:

Slowest traces in affected timeframe
Error traces
Traces crossing service boundaries

3. Change Correlation

Recent changes are the most common incident cause. Check:

Change Audit:
- [ ] Recent deployments (last 24h)
- [ ] Config changes
- [ ] Feature flag changes
- [ ] Infrastructure changes
- [ ] Database migrations
- [ ] Dependency updates

Impact Assessment

Quantify the blast radius:

Impact Assessment:
- Affected users: [count or percentage]
- Affected regions: [list]
- Revenue impact: [if calculable]
- Data integrity: [confirmed OK / under investigation]
- Duration so far: [time]

Mitigation Actions

Prioritize stopping the bleeding over finding root cause.

Quick Mitigation Options

Action	When to Use	Risk
Rollback	Bad deployment identified	Low
Feature flag disable	New feature causing issues	Low
Scale up	Capacity exhaustion	Low
Restart	Memory leak, stuck process	Medium
Failover	Regional/AZ issue	Medium
Circuit breaker	Dependency failure	Low

Mitigation Checklist

Mitigation:
- [ ] Identify mitigation action
- [ ] Assess rollback risk
- [ ] Execute mitigation
- [ ] Verify improvement
- [ ] Monitor for recurrence

Communication

Status Update Template

**Incident Update - [SERVICE] - [SEV LEVEL]**

**Status**: Investigating / Identified / Mitigating / Resolved
**Impact**: [Brief description of user impact]
**Current Actions**: [What's being done now]
**Next Update**: [Time or "when we have new information"]

Stakeholder Communication

SEV1/SEV2: Proactive updates every 15-30 minutes
SEV3: Update when status changes
SEV4: Update in ticket

Root Cause Analysis

Only after mitigation. Don’t debug while the site is down.

The 5 Whys

Why 1: [Immediate cause]
  â
Why 2: [Underlying cause]
  â
Why 3: [Contributing factor]
  â
Why 4: [Process/system gap]
  â
Why 5: [Root cause]

Common Failure Modes

Before deep investigation, check common patterns in references/common-failure-modes.md.

Evidence Collection

Preserve for post-incident:

Screenshots of dashboards
Relevant log snippets
Timeline of events
Commands executed
Configuration at time of incident

Resolution & Handoff

Resolution Checklist:
- [ ] Service restored to normal
- [ ] Monitoring confirms stability (15+ min)
- [ ] Incident channel updated with resolution
- [ ] Follow-up items captured
- [ ] Post-incident review scheduled (SEV1/SEV2)

Handoff Template

If handing off to another responder:

Incident Handoff:
- Summary: [1-2 sentences]
- Current status: [state]
- What's been tried: [list]
- Working theory: [hypothesis]
- Next steps: [recommended actions]
- Key links: [dashboards, logs, docs]

Post-Incident Review

For SEV1/SEV2 incidents, schedule within 48-72 hours.

Blameless Review Questions

What happened? (Timeline)
What was the impact?
How was it detected?
How was it mitigated?
What was the root cause?
What could we do differently?
What action items will prevent recurrence?

Action Item Template

Action Item:
- Title: [Brief description]
- Owner: [Person/team]
- Priority: [P0/P1/P2]
- Due: [Date]
- Type: [Prevention / Detection / Response]

Quick Reference Commands

Kubernetes

# Pod status
kubectl get pods -n <namespace> | grep -v Running

# Recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Pod logs
kubectl logs <pod> -n <namespace> --tail=100

Docker

# Container status
docker ps -a | head -20

# Container logs
docker logs --tail 100 <container>

# Resource usage
docker stats --no-stream

System

# Resource pressure
top -bn1 | head -20
df -h
free -m

# Network connections
netstat -tuln | grep LISTEN
ss -tuln

Additional Resources

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台