juicebox-incident-runbook
9
总安装量
9
周安装量
#32512
全站排名
安装命令
npx skills add https://github.com/jeremylongshore/claude-code-plugins-plus-skills --skill juicebox-incident-runbook
Agent 安装分布
claude-code
8
mcpjam
7
mistral-vibe
7
junie
7
windsurf
7
zencoder
7
Skill 文档
Juicebox Incident Runbook
Overview
Standardized incident response procedures for Juicebox integration issues.
Incident Severity Levels
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Critical | < 15 min | Complete outage, data loss |
| P2 | High | < 1 hour | Major feature broken, degraded performance |
| P3 | Medium | < 4 hours | Minor feature issue, workaround exists |
| P4 | Low | < 24 hours | Cosmetic, non-blocking |
Quick Diagnostics
Step 1: Immediate Assessment
#!/bin/bash
# quick-diag.sh - Run immediately when incident detected
echo "=== Juicebox Quick Diagnostics ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Check Juicebox status page
echo ""
echo "=== Juicebox Status ==="
curl -s https://status.juicebox.ai/api/status | jq '.status'
# Check our API health
echo ""
echo "=== Our API Health ==="
curl -s http://localhost:8080/health/ready | jq '.'
# Check recent error logs
echo ""
echo "=== Recent Errors (last 5 min) ==="
kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20
# Check metrics
echo ""
echo "=== Error Rate ==="
curl -s http://localhost:9090/api/v1/query?query=rate\(juicebox_requests_total\{status=\"error\"\}\[5m\]\) | jq '.data.result[0].value[1]'
Step 2: Identify Root Cause
## Incident Triage Decision Tree
1. Is Juicebox status page showing issues?
- YES â External outage, skip to "External Outage Response"
- NO â Continue
2. Are we getting authentication errors (401)?
- YES â Check API key validity, skip to "Auth Issues"
- NO â Continue
3. Are we getting rate limited (429)?
- YES â Skip to "Rate Limit Response"
- NO â Continue
4. Are requests timing out?
- YES â Skip to "Timeout Response"
- NO â Continue
5. Are we getting unexpected errors?
- YES â Skip to "Application Error Response"
- NO â Gather more data
Response Procedures
External Outage Response
## When Juicebox is Down
1. **Confirm Outage**
- Check https://status.juicebox.ai
- Verify with curl test to API
2. **Enable Fallback Mode**
```bash
kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true
-
Notify Stakeholders
- Post to #incidents channel
- Update status page if customer-facing
-
Monitor Recovery
- Set up alert for Juicebox status change
- Prepare to disable fallback mode
-
Post-Incident
- Disable fallback when Juicebox recovers
- Document timeline and impact
### Auth Issues Response
```markdown
## When Authentication Fails
1. **Verify API Key**
```bash
# Mask key for logging
echo "Key prefix: ${JUICEBOX_API_KEY:0:10}..."
# Test auth
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
https://api.juicebox.ai/v1/auth/me
-
Check Key Status in Dashboard
- Log into https://app.juicebox.ai
- Verify key is active and not revoked
-
Rotate Key if Compromised
- Generate new key in dashboard
- Update secret manager
- Restart pods
kubectl rollout restart deployment/juicebox-integration -
Verify Recovery
- Check health endpoint
- Monitor error rate
### Rate Limit Response
```markdown
## When Rate Limited
1. **Check Current Usage**
```bash
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
https://api.juicebox.ai/v1/usage
-
Immediate Mitigation
- Enable aggressive caching
- Reduce request rate
kubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10 -
If Quota Exhausted
- Contact Juicebox support for temporary increase
- Implement request queuing
-
Long-term Fix
- Review usage patterns
- Implement better caching
- Consider plan upgrade
### Timeout Response
```markdown
## When Requests Timeout
1. **Check Network**
```bash
# DNS resolution
nslookup api.juicebox.ai
# Connectivity
curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health
-
Check Load
- Review query complexity
- Check for unusually large requests
-
Increase Timeout
kubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000 -
Implement Circuit Breaker
- Enable circuit breaker if repeated timeouts
- Serve cached/fallback data
## Incident Communication Template
```markdown
## Incident Report Template
**Incident ID:** INC-YYYY-MM-DD-XXX
**Status:** Investigating | Identified | Monitoring | Resolved
**Severity:** P1 | P2 | P3 | P4
**Start Time:** YYYY-MM-DD HH:MM UTC
**End Time:** (when resolved)
### Summary
[Brief description of the incident]
### Impact
- Users affected: [number/percentage]
- Features affected: [list]
- Duration: [time]
### Timeline
- HH:MM - Incident detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Incident resolved
### Root Cause
[Description of what caused the incident]
### Resolution
[What was done to fix it]
### Action Items
- [ ] Action 1 (Owner, Due Date)
- [ ] Action 2 (Owner, Due Date)
On-Call Checklist
## On-Call Handoff Checklist
### Before Shift
- [ ] Access to all monitoring dashboards
- [ ] VPN/access to production systems
- [ ] Runbook bookmarked
- [ ] Escalation contacts available
### During Shift
- [ ] Check dashboards every 30 min
- [ ] Respond to alerts within SLA
- [ ] Document all incidents
- [ ] Escalate P1/P2 immediately
### End of Shift
- [ ] Handoff open incidents
- [ ] Update incident log
- [ ] Brief incoming on-call
Output
- Diagnostic scripts
- Response procedures
- Communication templates
- On-call checklists
Resources
Next Steps
After incident, see juicebox-data-handling for data management.