juicebox-incident-runbook

📁 jeremylongshore/claude-code-plugins-plus-skills 📅 11 days ago

总安装量

周安装量

#32512

全站排名

安装命令

npx skills add https://github.com/jeremylongshore/claude-code-plugins-plus-skills --skill juicebox-incident-runbook

Agent 安装分布

claude-code 8

mcpjam 7

mistral-vibe 7

junie 7

windsurf 7

zencoder 7

Skill 文档

Juicebox Incident Runbook

Overview

Standardized incident response procedures for Juicebox integration issues.

Incident Severity Levels

Severity	Description	Response Time	Examples
P1	Critical	< 15 min	Complete outage, data loss
P2	High	< 1 hour	Major feature broken, degraded performance
P3	Medium	< 4 hours	Minor feature issue, workaround exists
P4	Low	< 24 hours	Cosmetic, non-blocking

Quick Diagnostics

Step 1: Immediate Assessment

#!/bin/bash
# quick-diag.sh - Run immediately when incident detected

echo "=== Juicebox Quick Diagnostics ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Check Juicebox status page
echo ""
echo "=== Juicebox Status ==="
curl -s https://status.juicebox.ai/api/status | jq '.status'

# Check our API health
echo ""
echo "=== Our API Health ==="
curl -s http://localhost:8080/health/ready | jq '.'

# Check recent error logs
echo ""
echo "=== Recent Errors (last 5 min) ==="
kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20

# Check metrics
echo ""
echo "=== Error Rate ==="
curl -s http://localhost:9090/api/v1/query?query=rate\(juicebox_requests_total\{status=\"error\"\}\[5m\]\) | jq '.data.result[0].value[1]'

Step 2: Identify Root Cause

## Incident Triage Decision Tree

1. Is Juicebox status page showing issues?
   - YES â External outage, skip to "External Outage Response"
   - NO â Continue

2. Are we getting authentication errors (401)?
   - YES â Check API key validity, skip to "Auth Issues"
   - NO â Continue

3. Are we getting rate limited (429)?
   - YES â Skip to "Rate Limit Response"
   - NO â Continue

4. Are requests timing out?
   - YES â Skip to "Timeout Response"
   - NO â Continue

5. Are we getting unexpected errors?
   - YES â Skip to "Application Error Response"
   - NO â Gather more data

Response Procedures

External Outage Response

## When Juicebox is Down

1. **Confirm Outage**
   - Check https://status.juicebox.ai
   - Verify with curl test to API

2. **Enable Fallback Mode**
   ```bash
   kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true

Notify Stakeholders
- Post to #incidents channel
- Update status page if customer-facing
Monitor Recovery
- Set up alert for Juicebox status change
- Prepare to disable fallback mode
Post-Incident
- Disable fallback when Juicebox recovers
- Document timeline and impact


### Auth Issues Response
```markdown
## When Authentication Fails

1. **Verify API Key**
   ```bash
   # Mask key for logging
   echo "Key prefix: ${JUICEBOX_API_KEY:0:10}..."

   # Test auth
   curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
     https://api.juicebox.ai/v1/auth/me

Check Key Status in Dashboard
- Log into https://app.juicebox.ai
- Verify key is active and not revoked
Rotate Key if Compromised
- Generate new key in dashboard
- Update secret manager
- Restart pods
```
kubectl rollout restart deployment/juicebox-integration
```
Verify Recovery
- Check health endpoint
- Monitor error rate


### Rate Limit Response
```markdown
## When Rate Limited

1. **Check Current Usage**
   ```bash
   curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
     https://api.juicebox.ai/v1/usage

Immediate Mitigation

Enable aggressive caching
Reduce request rate

kubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10

If Quota Exhausted
- Contact Juicebox support for temporary increase
- Implement request queuing
Long-term Fix
- Review usage patterns
- Implement better caching
- Consider plan upgrade


### Timeout Response
```markdown
## When Requests Timeout

1. **Check Network**
   ```bash
   # DNS resolution
   nslookup api.juicebox.ai

   # Connectivity
   curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health

Check Load
- Review query complexity
- Check for unusually large requests

Increase Timeout

kubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000

Implement Circuit Breaker
- Enable circuit breaker if repeated timeouts
- Serve cached/fallback data


## Incident Communication Template

```markdown
## Incident Report Template

**Incident ID:** INC-YYYY-MM-DD-XXX
**Status:** Investigating | Identified | Monitoring | Resolved
**Severity:** P1 | P2 | P3 | P4
**Start Time:** YYYY-MM-DD HH:MM UTC
**End Time:** (when resolved)

### Summary
[Brief description of the incident]

### Impact
- Users affected: [number/percentage]
- Features affected: [list]
- Duration: [time]

### Timeline
- HH:MM - Incident detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Incident resolved

### Root Cause
[Description of what caused the incident]

### Resolution
[What was done to fix it]

### Action Items
- [ ] Action 1 (Owner, Due Date)
- [ ] Action 2 (Owner, Due Date)

On-Call Checklist

## On-Call Handoff Checklist

### Before Shift
- [ ] Access to all monitoring dashboards
- [ ] VPN/access to production systems
- [ ] Runbook bookmarked
- [ ] Escalation contacts available

### During Shift
- [ ] Check dashboards every 30 min
- [ ] Respond to alerts within SLA
- [ ] Document all incidents
- [ ] Escalate P1/P2 immediately

### End of Shift
- [ ] Handoff open incidents
- [ ] Update incident log
- [ ] Brief incoming on-call

Output

Diagnostic scripts
Response procedures
Communication templates
On-call checklists

Resources

Next Steps

After incident, see juicebox-data-handling for data management.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台