guidewire-incident-runbook
9
总安装量
8
周安装量
#31984
全站排名
安装命令
npx skills add https://github.com/jeremylongshore/claude-code-plugins-plus-skills --skill guidewire-incident-runbook
Agent 安装分布
cline
7
codex
7
gemini-cli
7
cursor
7
opencode
7
openclaw
7
Skill 文档
Guidewire Incident Runbook
Overview
Production incident response procedures for Guidewire InsuranceSuite including triage, diagnosis, resolution, and post-incident review.
Incident Severity Levels
| Severity | Definition | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Complete outage | 15 minutes | All users cannot access system |
| SEV-2 | Major degradation | 30 minutes | Critical workflow blocked |
| SEV-3 | Partial degradation | 2 hours | Non-critical feature unavailable |
| SEV-4 | Minor issue | 24 hours | Cosmetic or low-impact issue |
Incident Response Flow
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Incident Response Workflow â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â â
â âââââââââââââ âââââââââââââ âââââââââââââ âââââââââââââ â
â â DETECT âââââ¶â TRIAGE âââââ¶â RESPOND âââââ¶â RESOLVE â â
â â â â â â â â â â
â â ⢠Alert â â ⢠Severityâ â ⢠Diagnoseâ â ⢠Fix â â
â â ⢠Monitor â â ⢠Impact â â ⢠Mitigateâ â ⢠Verify â â
â â ⢠Report â â ⢠Assign â â ⢠Escalateâ â ⢠Documentâ â
â âââââââââââââ âââââââââââââ âââââââââââââ âââââââââââââ â
â â â â â â
â ââââââââââââââââââ´âââââââââââââââââ´âââââââââââââââââ â
â â â
â ââââââââââââ¼âââââââââââ â
â â POST-INCIDENT â â
â â â â
â â ⢠Review â â
â â ⢠Root Cause â â
â â ⢠Action Items â â
â âââââââââââââââââââââââ â
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Common Incident Scenarios
Scenario 1: API Unavailability
Symptoms:
- HTTP 503 responses
- Connection timeouts
- “Service Unavailable” errors
Diagnosis Steps:
# 1. Check Guidewire Cloud Console health
curl -s https://gcc.guidewire.com/api/v1/status \
-H "Authorization: Bearer ${TOKEN}" | jq
# 2. Check application health endpoints
curl -s "${POLICYCENTER_URL}/common/v1/system-info" \
-H "Authorization: Bearer ${TOKEN}"
# 3. Review recent logs
# In Guidewire Cloud Console: Observability > Logs
# Query: level:ERROR AND timestamp:[now-15m TO now]
# 4. Check for recent deployments
# GCC: Deployments > Recent Activity
Resolution Steps:
-
If Guidewire infrastructure issue:
- Check Guidewire Status Page
- Open support ticket with Guidewire
- Notify stakeholders of vendor issue
-
If integration/configuration issue:
# Check integration health curl -s "${API_URL}/health" | jq # Restart affected services (if self-managed components) kubectl rollout restart deployment/integration-service -
If capacity issue:
- Scale up instances in GCC
- Review rate limiting settings
- Implement request throttling
Scenario 2: High Error Rate
Symptoms:
- Elevated 4xx/5xx responses
- Failed transactions
- User-reported errors
Diagnosis Steps:
// Error analysis script
async function analyzeErrors(timeRange: string = '1h'): Promise<ErrorAnalysis> {
const logs = await fetchLogs({
level: 'ERROR',
timeRange,
limit: 1000
});
// Group by error type
const byType = logs.reduce((acc, log) => {
const key = log.context?.error_type || 'unknown';
acc[key] = (acc[key] || 0) + 1;
return acc;
}, {} as Record<string, number>);
// Group by endpoint
const byEndpoint = logs.reduce((acc, log) => {
const key = log.context?.endpoint || 'unknown';
acc[key] = (acc[key] || 0) + 1;
return acc;
}, {} as Record<string, number>);
// Find most common error
const topError = Object.entries(byType)
.sort((a, b) => b[1] - a[1])[0];
return {
totalErrors: logs.length,
errorsByType: byType,
errorsByEndpoint: byEndpoint,
topError: topError ? { type: topError[0], count: topError[1] } : null,
sampleErrors: logs.slice(0, 10)
};
}
Resolution Steps:
-
Authentication errors (401):
# Verify OAuth configuration curl -s "${GW_HUB_URL}/oauth/token" \ -d "grant_type=client_credentials&client_id=${CLIENT_ID}&client_secret=${CLIENT_SECRET}" \ | jq # Check if credentials were rotated # Verify client ID/secret in secret manager -
Validation errors (422):
# Review recent code changes git log --oneline --since="2 hours ago" # Check if data schema changed # Review API response details for specific fields -
Server errors (500):
# Check application logs for stack traces # Review memory/CPU utilization # Check database connection pool
Scenario 3: Performance Degradation
Symptoms:
- Slow page loads
- High API latency
- Timeout errors
Diagnosis Steps:
// Performance diagnostic Gosu script
package gw.incident.diagnosis
uses gw.api.database.Query
uses gw.api.util.Logger
class PerformanceDiagnostics {
private static final var LOG = Logger.forCategory("PerformanceDiagnostics")
static function runDiagnostics() : DiagnosticReport {
var report = new DiagnosticReport()
// Check database connection pool
report.DatabasePoolStatus = checkDatabasePool()
// Check slow queries
report.SlowQueries = findSlowQueries()
// Check memory usage
report.MemoryStatus = checkMemory()
// Check active sessions
report.ActiveSessions = countActiveSessions()
return report
}
private static function checkDatabasePool() : String {
var pool = gw.api.database.DatabaseConnectionPool.getInstance()
var available = pool.AvailableConnections
var total = pool.TotalConnections
var waiting = pool.WaitingThreads
if (waiting > 10) {
return "CRITICAL: ${waiting} threads waiting for connections"
} else if (available < total * 0.1) {
return "WARNING: Only ${available}/${total} connections available"
}
return "OK: ${available}/${total} connections available"
}
private static function checkMemory() : String {
var runtime = Runtime.getRuntime()
var used = (runtime.totalMemory() - runtime.freeMemory()) / 1024 / 1024
var max = runtime.maxMemory() / 1024 / 1024
var usedPercent = (used * 100.0 / max) as int
if (usedPercent > 90) {
return "CRITICAL: ${usedPercent}% memory used (${used}MB/${max}MB)"
} else if (usedPercent > 80) {
return "WARNING: ${usedPercent}% memory used (${used}MB/${max}MB)"
}
return "OK: ${usedPercent}% memory used (${used}MB/${max}MB)"
}
}
Resolution Steps:
-
Database bottleneck:
-- Find long-running queries SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '30 seconds' ORDER BY duration DESC; -- Kill long-running query if necessary SELECT pg_terminate_backend(pid); -
Memory pressure:
# Trigger garbage collection (if necessary) # This is typically managed by JVM # Consider scaling up instances # In GCC: Infrastructure > Scaling > Adjust instance size -
Integration bottleneck:
# Check external service health for endpoint in $INTEGRATION_ENDPOINTS; do curl -s -w "\\n%{time_total}s" "${endpoint}/health" done # Enable circuit breaker if external service is slow
Scenario 4: Data Integrity Issue
Symptoms:
- Incorrect calculations
- Missing data
- Data inconsistency between systems
Diagnosis Steps:
// Data integrity check
package gw.incident.data
uses gw.api.database.Query
uses gw.api.util.Logger
class DataIntegrityChecker {
private static final var LOG = Logger.forCategory("DataIntegrity")
static function checkPolicyIntegrity() : List<String> {
var issues = new ArrayList<String>()
// Check for policies without accounts
var orphanPolicies = Query.make(Policy)
.compare(Policy#Account, Equals, null)
.select()
.Count
if (orphanPolicies > 0) {
issues.add("Found ${orphanPolicies} policies without accounts")
}
// Check for claims without policies
var orphanClaims = Query.make(Claim)
.compare(Claim#Policy, Equals, null)
.select()
.Count
if (orphanClaims > 0) {
issues.add("Found ${orphanClaims} claims without policies")
}
// Check for premium calculation mismatches
var premiumMismatches = Query.make(PolicyPeriod)
.select()
.where(\pp -> pp.TotalPremiumRPT != pp.calculatePremium())
.Count
if (premiumMismatches > 0) {
issues.add("Found ${premiumMismatches} premium calculation mismatches")
}
return issues
}
}
Resolution Steps:
-
Immediate mitigation:
- Disable affected functionality if critical
- Notify affected users
- Create support ticket
-
Data correction:
// Careful: Run in test environment first! gw.transaction.Transaction.runWithNewBundle(\bundle -> { // Fix specific data issue var affectedRecords = Query.make(Entity) .compare(Entity#Field, Equals, badValue) .select() affectedRecords.each(\record -> { var r = bundle.add(record) r.Field = correctValue LOG.info("Corrected record: ${r.PublicID}") }) })
Incident Communication Templates
Initial Notification (SEV-1/SEV-2)
INCIDENT: [Brief Description]
SEVERITY: SEV-[1/2]
STATUS: Investigating
IMPACT:
- [Describe user impact]
- [Number of users/systems affected]
CURRENT STATUS:
- Issue detected at [TIME]
- Team is actively investigating
- Next update in 30 minutes
INCIDENT COMMANDER: [Name]
CONTACT: [Slack channel / Bridge call]
Status Update
INCIDENT UPDATE: [Brief Description]
STATUS: [Investigating / Mitigating / Resolved]
UPDATES SINCE LAST COMMUNICATION:
- [Update 1]
- [Update 2]
CURRENT ACTIONS:
- [Action being taken]
NEXT STEPS:
- [Planned action]
- Next update: [TIME]
Resolution Notification
INCIDENT RESOLVED: [Brief Description]
DURATION: [Start time] - [End time] ([X] hours [Y] minutes)
ROOT CAUSE:
[Brief description of root cause]
RESOLUTION:
[What was done to fix the issue]
IMPACT SUMMARY:
- [Number of affected users/transactions]
- [Business impact if known]
FOLLOW-UP ACTIONS:
- Post-incident review scheduled for [DATE]
- [Any immediate follow-up items]
Post-Incident Review Template
# Post-Incident Review: [Incident Title]
## Incident Summary
- **Date:** [Date]
- **Duration:** [Start] - [End]
- **Severity:** SEV-[X]
- **Incident Commander:** [Name]
## Timeline
| Time | Event |
|------|-------|
| HH:MM | [Event description] |
| HH:MM | [Event description] |
## Root Cause
[Detailed description of the root cause]
## Impact
- **Users Affected:** [Number]
- **Transactions Affected:** [Number]
- **Revenue Impact:** [If applicable]
## What Went Well
- [Positive aspect 1]
- [Positive aspect 2]
## What Could Be Improved
- [Improvement area 1]
- [Improvement area 2]
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action description] | [Name] | [Date] |
## Lessons Learned
[Key takeaways for the team]
Emergency Contacts
| Role | Name | Contact |
|---|---|---|
| Primary On-Call | [Rotation] | PagerDuty |
| Secondary On-Call | [Rotation] | PagerDuty |
| Engineering Manager | [Name] | [Phone] |
| Guidewire Support | – | support.guidewire.com |
| Security Incident | Security Team | security@company.com |
Output
- Incident detection and triage procedures
- Common scenario resolution steps
- Communication templates
- Post-incident review process
Resources
Next Steps
For data handling procedures, see guidewire-data-handling.