incident-management

📁 nguyenhuuca/assessment 📅 10 days ago
8
总安装量
7
周安装量
#34020
全站排名
安装命令
npx skills add https://github.com/nguyenhuuca/assessment --skill incident-management

Agent 安装分布

mcpjam 7
claude-code 7
replit 7
junie 7
windsurf 7
zencoder 7

Skill 文档

Incident Management

Incident Severity

Level Impact Response Time
SEV1 Complete outage Immediate
SEV2 Major degradation < 15 min
SEV3 Minor degradation < 1 hour
SEV4 Low impact Next business day

Incident Response

1. Detect

  • Monitoring alerts
  • Customer reports
  • Error logs

2. Triage

  • Assess severity
  • Assign incident commander
  • Create communication channel

3. Investigate

  • Check recent changes
  • Review logs and metrics
  • Identify root cause

4. Mitigate

  • Apply quick fix
  • Rollback if needed
  • Communicate status

5. Resolve

  • Confirm fix
  • Monitor for recurrence
  • Close incident

6. Learn

  • Post-mortem meeting
  • Document findings
  • Create action items

Post-Mortem Template

# Post-Mortem: [Incident Title]

## Summary
[Brief description of what happened]

## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]

## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]

## Root Cause
[What caused this incident]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive 1]
- [Positive 2]

## What Could Be Improved
- [Improvement 1]
- [Improvement 2]

## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]

Blameless Culture

  • Focus on systems, not people
  • “What failed?” not “Who failed?”
  • Share learnings openly
  • Celebrate near-misses