root-cause-analysis
1
总安装量
1
周安装量
#52803
全站排名
安装命令
npx skills add https://github.com/latestaiagents/agent-skills --skill root-cause-analysis
Agent 安装分布
mcpjam
1
claude-code
1
replit
1
windsurf
1
zencoder
1
Skill 文档
Root Cause Analysis (RCA)
Find the real cause, not just the symptoms, to prevent recurrence.
RCA Principles
- Look for systems failures, not human errors
- Ask “why” until you find actionable causes
- Multiple contributing factors are common
- Prevention > blame
Method 1: 5 Whys
Keep asking “why” until you reach an actionable root cause.
Example: API Outage
Problem: API returned 500 errors for 45 minutes
Why #1: Why did the API return 500 errors?
â The database connection pool was exhausted
Why #2: Why was the connection pool exhausted?
â Connections weren't being released after queries
Why #3: Why weren't connections being released?
â A code change introduced a bug that skipped connection.close()
Why #4: Why wasn't this caught before production?
â Our integration tests don't check for connection leaks
Why #5: Why don't integration tests check for connection leaks?
â We haven't implemented connection pool monitoring in tests
ROOT CAUSE: Missing connection leak detection in test suite
ACTION: Add connection pool assertions to integration tests
5 Whys Guidelines
| Do | Don’t |
|---|---|
| Use data, not assumptions | Stop at “human error” |
| Consider multiple branches | Accept vague answers |
| Verify each “because” | Skip to conclusions |
| Look for systemic issues | Blame individuals |
Method 2: Contributing Factors Analysis
Most incidents have multiple contributing factors.
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â INCIDENT: API OUTAGE â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â â
â Direct Cause: â
â ââ Database connection pool exhaustion â
â â
â Contributing Factors: â
â ââ [Code] Connection leak bug in PR #1234 â
â ââ [Process] Code review didn't catch the bug â
â ââ [Testing] No connection leak tests â
â ââ [Monitoring] No alert for connection pool usage â
â ââ [Deploy] Deployed during high-traffic period â
â ââ [Recovery] Runbook for this scenario was outdated â
â â
â Environmental Factors: â
â ââ Team was understaffed (vacation season) â
â ââ Similar incident 6 months ago, action items incomplete â
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Method 3: Fault Tree Analysis
Work backwards from failure to identify all paths.
[API Outage]
â
ââââââââââââââ´âââââââââââââ
â â
[DB Connections [App Server
Exhausted] Crashed]
â â
âââââââââ´ââââââââ â
â â â
[Connection [Too Many [OOM
Leak] Requests] Error]
â â â
â âââââââ´ââââââ â
â â â â
[Bug in [Traffic [Missing [Memory
Code] Spike] Rate Leak]
â Limit]
â
[Marketing
Campaign]
Method 4: Timeline Reconstruction
Detailed timeline helps identify the chain of events.
Timeline: API Outage - 2026-01-15
Time (UTC) | Event | Source
------------|------------------------------|--------
09:00 | Deploy v2.3.4 started | GitHub
09:15 | Deploy completed | K8s
09:45 | Marketing email sent (50k) | Marketing
10:02 | Traffic spike begins | Datadog
10:15 | Connection pool at 80% | Metrics
10:23 | First 500 errors | Logs
10:25 | Alert fired | PagerDuty
10:27 | On-call acknowledged | PagerDuty
10:35 | Root cause identified | Slack
10:42 | Rollback initiated | K8s
10:48 | Service recovering | Datadog
11:00 | All clear declared | Slack
Key Finding: 38 minutes between deploy and issue detection
Deploy + traffic spike = perfect storm
Common Root Cause Categories
Technical
- Code bugs
- Configuration errors
- Infrastructure failures
- Dependency failures
- Capacity limits
Process
- Inadequate testing
- Missed code review
- Incomplete runbooks
- Poor change management
- Insufficient monitoring
Organizational
- Understaffing
- Knowledge silos
- Communication gaps
- Incomplete training
- Technical debt
Action Item Quality
Good action items are SMART:
| Criteria | Bad Example | Good Example |
|---|---|---|
| Specific | “Improve testing” | “Add connection pool leak test to CI” |
| Measurable | “Monitor better” | “Alert when pool > 80% for 5 min” |
| Assignable | “Team should fix” | “@jane owns implementation” |
| Realistic | “Rewrite entire system” | “Add circuit breaker to DB calls” |
| Time-bound | “Soon” | “Complete by 2026-02-01” |
RCA Template
## Root Cause Analysis
### Direct Cause
[What directly caused the incident]
### 5 Whys Analysis
1. Why? â [Answer]
2. Why? â [Answer]
3. Why? â [Answer]
4. Why? â [Answer]
5. Why? â [Root cause]
### Contributing Factors
- **Technical:** [List]
- **Process:** [List]
- **Organizational:** [List]
### Why Wasn't This Caught?
- In development: [Why]
- In code review: [Why]
- In testing: [Why]
- In staging: [Why]
- By monitoring: [Why]
### Action Items
| Priority | Action | Owner | Due | Prevents |
|----------|--------|-------|-----|----------|
| P0 | [Action] | @name | [Date] | Direct cause |
| P1 | [Action] | @name | [Date] | Detection |
| P2 | [Action] | @name | [Date] | Future risk |
Anti-Patterns to Avoid
- “Human error” – The human made an error, but the system allowed it
- “Lack of attention” – Why did the system require such attention?
- “Should have known” – How could they have known?
- “Didn’t follow procedure” – Why was the procedure not followed?
- Single root cause – Usually there are multiple contributing factors