incident-management
8
总安装量
7
周安装量
#34020
全站排名
安装命令
npx skills add https://github.com/nguyenhuuca/assessment --skill incident-management
Agent 安装分布
mcpjam
7
claude-code
7
replit
7
junie
7
windsurf
7
zencoder
7
Skill 文档
Incident Management
Incident Severity
| Level | Impact | Response Time |
|---|---|---|
| SEV1 | Complete outage | Immediate |
| SEV2 | Major degradation | < 15 min |
| SEV3 | Minor degradation | < 1 hour |
| SEV4 | Low impact | Next business day |
Incident Response
1. Detect
- Monitoring alerts
- Customer reports
- Error logs
2. Triage
- Assess severity
- Assign incident commander
- Create communication channel
3. Investigate
- Check recent changes
- Review logs and metrics
- Identify root cause
4. Mitigate
- Apply quick fix
- Rollback if needed
- Communicate status
5. Resolve
- Confirm fix
- Monitor for recurrence
- Close incident
6. Learn
- Post-mortem meeting
- Document findings
- Create action items
Post-Mortem Template
# Post-Mortem: [Incident Title]
## Summary
[Brief description of what happened]
## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]
## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]
## Root Cause
[What caused this incident]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## What Could Be Improved
- [Improvement 1]
- [Improvement 2]
## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]
Blameless Culture
- Focus on systems, not people
- “What failed?” not “Who failed?”
- Share learnings openly
- Celebrate near-misses