incident-response

📁 nickcrew/claude-ctx-plugin 📅 4 days ago
8
总安装量
8
周安装量
#36010
全站排名
安装命令
npx skills add https://github.com/nickcrew/claude-ctx-plugin --skill incident-response

Agent 安装分布

opencode 8
gemini-cli 7
codebuddy 7
github-copilot 7
codex 7
kimi-cli 7

Skill 文档

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

  • Production incident in progress (outage, degradation, data loss)
  • Designing circuit breakers, bulkheads, or fallback strategies
  • Conducting or planning chaos engineering exercises
  • Writing or reviewing postmortem documents
  • Establishing on-call procedures and escalation paths

Avoid when:

  • The issue is a development-time bug with no production impact
  • Designing general system architecture (use system-design instead)

Quick Reference

Topic Load reference
Triage Framework skills/incident-response/references/triage-framework.md
Postmortem Patterns skills/incident-response/references/postmortem-patterns.md

Incident Response Workflow

Phase 1: Detect

  • Alert fires or user report received
  • Confirm the issue is real (not a false positive)
  • Identify affected services and user impact scope

Phase 2: Triage

  • Classify severity (P0-P3)
  • Assign incident commander
  • Open communication channel (war room, Slack channel)
  • Begin status page updates

Phase 3: Contain

  • Stop the bleeding: rollback, feature flag, traffic shift
  • Prevent cascade: circuit breakers, load shedding, bulkhead isolation
  • Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

  • Implement fix (minimal viable fix first)
  • Validate in staging if time permits
  • Deploy with monitoring and rollback plan ready
  • Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

  • Document timeline within 48 hours
  • Conduct blameless review with all participants
  • Identify root cause and contributing factors
  • Assign action items with owners and deadlines
  • Update runbooks and alerting based on lessons learned

Severity Framework

Level Impact Response Time Examples
P0 Complete outage, data loss, security breach Immediate (< 5 min) Service down, data corruption, credential leak
P1 Major feature broken, significant user impact < 30 min Payment processing failed, auth broken for region
P2 Degraded performance, partial feature loss < 4 hours Elevated latency, non-critical feature unavailable
P3 Minor issue, workaround available Next business day UI glitch, slow report generation, cosmetic error

Output

  • Incident timeline and severity classification
  • Containment actions taken
  • Postmortem document with action items
  • Updated runbooks and alerting rules

Common Mistakes

  • Skipping severity classification and treating everything as P0
  • Making changes without a rollback plan
  • Forgetting to communicate status to stakeholders
  • Writing postmortems that assign blame instead of identifying systemic issues
  • Not following up on postmortem action items