guidewire-incident-runbook

📁 jeremylongshore/claude-code-plugins-plus-skills 📅 10 days ago

总安装量

周安装量

#31984

全站排名

安装命令

npx skills add https://github.com/jeremylongshore/claude-code-plugins-plus-skills --skill guidewire-incident-runbook

Agent 安装分布

cline 7

codex 7

gemini-cli 7

cursor 7

opencode 7

openclaw 7

Skill 文档

Guidewire Incident Runbook

Overview

Production incident response procedures for Guidewire InsuranceSuite including triage, diagnosis, resolution, and post-incident review.

Incident Severity Levels

Severity	Definition	Response Time	Examples
SEV-1	Complete outage	15 minutes	All users cannot access system
SEV-2	Major degradation	30 minutes	Critical workflow blocked
SEV-3	Partial degradation	2 hours	Non-critical feature unavailable
SEV-4	Minor issue	24 hours	Cosmetic or low-impact issue

Incident Response Flow

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                          Incident Response Workflow                              â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                                  â
â  âââââââââââââ    âââââââââââââ    âââââââââââââ    âââââââââââââ              â
â  â  DETECT   âââââ¶â  TRIAGE   âââââ¶â  RESPOND  âââââ¶â  RESOLVE  â              â
â  â           â    â           â    â           â    â           â              â
â  â â¢ Alert   â    â â¢ Severityâ    â â¢ Diagnoseâ    â â¢ Fix     â              â
â  â â¢ Monitor â    â â¢ Impact  â    â â¢ Mitigateâ    â â¢ Verify  â              â
â  â â¢ Report  â    â â¢ Assign  â    â â¢ Escalateâ    â â¢ Documentâ              â
â  âââââââââââââ    âââââââââââââ    âââââââââââââ    âââââââââââââ              â
â        â                â                â                â                     â
â        ââââââââââââââââââ´âââââââââââââââââ´âââââââââââââââââ                     â
â                                    â                                            â
â                         ââââââââââââ¼âââââââââââ                                 â
â                         â    POST-INCIDENT    â                                 â
â                         â                     â                                 â
â                         â â¢ Review            â                                 â
â                         â â¢ Root Cause        â                                 â
â                         â â¢ Action Items      â                                 â
â                         âââââââââââââââââââââââ                                 â
â                                                                                  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Common Incident Scenarios

Scenario 1: API Unavailability

Symptoms:

HTTP 503 responses
Connection timeouts
“Service Unavailable” errors

Diagnosis Steps:

# 1. Check Guidewire Cloud Console health
curl -s https://gcc.guidewire.com/api/v1/status \
  -H "Authorization: Bearer ${TOKEN}" | jq

# 2. Check application health endpoints
curl -s "${POLICYCENTER_URL}/common/v1/system-info" \
  -H "Authorization: Bearer ${TOKEN}"

# 3. Review recent logs
# In Guidewire Cloud Console: Observability > Logs
# Query: level:ERROR AND timestamp:[now-15m TO now]

# 4. Check for recent deployments
# GCC: Deployments > Recent Activity

Resolution Steps:

If Guidewire infrastructure issue:
- Check Guidewire Status Page
- Open support ticket with Guidewire
- Notify stakeholders of vendor issue

If integration/configuration issue:

# Check integration health
curl -s "${API_URL}/health" | jq

# Restart affected services (if self-managed components)
kubectl rollout restart deployment/integration-service

If capacity issue:
- Scale up instances in GCC
- Review rate limiting settings
- Implement request throttling

Scenario 2: High Error Rate

Symptoms:

Elevated 4xx/5xx responses
Failed transactions
User-reported errors

Diagnosis Steps:

// Error analysis script
async function analyzeErrors(timeRange: string = '1h'): Promise<ErrorAnalysis> {
  const logs = await fetchLogs({
    level: 'ERROR',
    timeRange,
    limit: 1000
  });

  // Group by error type
  const byType = logs.reduce((acc, log) => {
    const key = log.context?.error_type || 'unknown';
    acc[key] = (acc[key] || 0) + 1;
    return acc;
  }, {} as Record<string, number>);

  // Group by endpoint
  const byEndpoint = logs.reduce((acc, log) => {
    const key = log.context?.endpoint || 'unknown';
    acc[key] = (acc[key] || 0) + 1;
    return acc;
  }, {} as Record<string, number>);

  // Find most common error
  const topError = Object.entries(byType)
    .sort((a, b) => b[1] - a[1])[0];

  return {
    totalErrors: logs.length,
    errorsByType: byType,
    errorsByEndpoint: byEndpoint,
    topError: topError ? { type: topError[0], count: topError[1] } : null,
    sampleErrors: logs.slice(0, 10)
  };
}

Resolution Steps:

Authentication errors (401):

# Verify OAuth configuration
curl -s "${GW_HUB_URL}/oauth/token" \
  -d "grant_type=client_credentials&client_id=${CLIENT_ID}&client_secret=${CLIENT_SECRET}" \
  | jq

# Check if credentials were rotated
# Verify client ID/secret in secret manager

Validation errors (422):

# Review recent code changes
git log --oneline --since="2 hours ago"

# Check if data schema changed
# Review API response details for specific fields

Server errors (500):

# Check application logs for stack traces
# Review memory/CPU utilization
# Check database connection pool

Scenario 3: Performance Degradation

Symptoms:

Slow page loads
High API latency
Timeout errors

Diagnosis Steps:

// Performance diagnostic Gosu script
package gw.incident.diagnosis

uses gw.api.database.Query
uses gw.api.util.Logger

class PerformanceDiagnostics {
  private static final var LOG = Logger.forCategory("PerformanceDiagnostics")

  static function runDiagnostics() : DiagnosticReport {
    var report = new DiagnosticReport()

    // Check database connection pool
    report.DatabasePoolStatus = checkDatabasePool()

    // Check slow queries
    report.SlowQueries = findSlowQueries()

    // Check memory usage
    report.MemoryStatus = checkMemory()

    // Check active sessions
    report.ActiveSessions = countActiveSessions()

    return report
  }

  private static function checkDatabasePool() : String {
    var pool = gw.api.database.DatabaseConnectionPool.getInstance()
    var available = pool.AvailableConnections
    var total = pool.TotalConnections
    var waiting = pool.WaitingThreads

    if (waiting > 10) {
      return "CRITICAL: ${waiting} threads waiting for connections"
    } else if (available < total * 0.1) {
      return "WARNING: Only ${available}/${total} connections available"
    }
    return "OK: ${available}/${total} connections available"
  }

  private static function checkMemory() : String {
    var runtime = Runtime.getRuntime()
    var used = (runtime.totalMemory() - runtime.freeMemory()) / 1024 / 1024
    var max = runtime.maxMemory() / 1024 / 1024
    var usedPercent = (used * 100.0 / max) as int

    if (usedPercent > 90) {
      return "CRITICAL: ${usedPercent}% memory used (${used}MB/${max}MB)"
    } else if (usedPercent > 80) {
      return "WARNING: ${usedPercent}% memory used (${used}MB/${max}MB)"
    }
    return "OK: ${usedPercent}% memory used (${used}MB/${max}MB)"
  }
}

Resolution Steps:

Database bottleneck:

-- Find long-running queries
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '30 seconds'
ORDER BY duration DESC;

-- Kill long-running query if necessary
SELECT pg_terminate_backend(pid);

Memory pressure:

# Trigger garbage collection (if necessary)
# This is typically managed by JVM

# Consider scaling up instances
# In GCC: Infrastructure > Scaling > Adjust instance size

Integration bottleneck:

# Check external service health
for endpoint in $INTEGRATION_ENDPOINTS; do
  curl -s -w "\\n%{time_total}s" "${endpoint}/health"
done

# Enable circuit breaker if external service is slow

Scenario 4: Data Integrity Issue

Symptoms:

Incorrect calculations
Missing data
Data inconsistency between systems

Diagnosis Steps:

// Data integrity check
package gw.incident.data

uses gw.api.database.Query
uses gw.api.util.Logger

class DataIntegrityChecker {
  private static final var LOG = Logger.forCategory("DataIntegrity")

  static function checkPolicyIntegrity() : List<String> {
    var issues = new ArrayList<String>()

    // Check for policies without accounts
    var orphanPolicies = Query.make(Policy)
      .compare(Policy#Account, Equals, null)
      .select()
      .Count

    if (orphanPolicies > 0) {
      issues.add("Found ${orphanPolicies} policies without accounts")
    }

    // Check for claims without policies
    var orphanClaims = Query.make(Claim)
      .compare(Claim#Policy, Equals, null)
      .select()
      .Count

    if (orphanClaims > 0) {
      issues.add("Found ${orphanClaims} claims without policies")
    }

    // Check for premium calculation mismatches
    var premiumMismatches = Query.make(PolicyPeriod)
      .select()
      .where(\pp -> pp.TotalPremiumRPT != pp.calculatePremium())
      .Count

    if (premiumMismatches > 0) {
      issues.add("Found ${premiumMismatches} premium calculation mismatches")
    }

    return issues
  }
}

Resolution Steps:

Immediate mitigation:
- Disable affected functionality if critical
- Notify affected users
- Create support ticket

Data correction:

// Careful: Run in test environment first!
gw.transaction.Transaction.runWithNewBundle(\bundle -> {
  // Fix specific data issue
  var affectedRecords = Query.make(Entity)
    .compare(Entity#Field, Equals, badValue)
    .select()

  affectedRecords.each(\record -> {
    var r = bundle.add(record)
    r.Field = correctValue
    LOG.info("Corrected record: ${r.PublicID}")
  })
})

Incident Communication Templates

Initial Notification (SEV-1/SEV-2)

INCIDENT: [Brief Description]
SEVERITY: SEV-[1/2]
STATUS: Investigating

IMPACT:
- [Describe user impact]
- [Number of users/systems affected]

CURRENT STATUS:
- Issue detected at [TIME]
- Team is actively investigating
- Next update in 30 minutes

INCIDENT COMMANDER: [Name]
CONTACT: [Slack channel / Bridge call]

Status Update

INCIDENT UPDATE: [Brief Description]
STATUS: [Investigating / Mitigating / Resolved]

UPDATES SINCE LAST COMMUNICATION:
- [Update 1]
- [Update 2]

CURRENT ACTIONS:
- [Action being taken]

NEXT STEPS:
- [Planned action]
- Next update: [TIME]

Resolution Notification

INCIDENT RESOLVED: [Brief Description]

DURATION: [Start time] - [End time] ([X] hours [Y] minutes)

ROOT CAUSE:
[Brief description of root cause]

RESOLUTION:
[What was done to fix the issue]

IMPACT SUMMARY:
- [Number of affected users/transactions]
- [Business impact if known]

FOLLOW-UP ACTIONS:
- Post-incident review scheduled for [DATE]
- [Any immediate follow-up items]

Post-Incident Review Template

# Post-Incident Review: [Incident Title]

## Incident Summary
- **Date:** [Date]
- **Duration:** [Start] - [End]
- **Severity:** SEV-[X]
- **Incident Commander:** [Name]

## Timeline
| Time | Event |
|------|-------|
| HH:MM | [Event description] |
| HH:MM | [Event description] |

## Root Cause
[Detailed description of the root cause]

## Impact
- **Users Affected:** [Number]
- **Transactions Affected:** [Number]
- **Revenue Impact:** [If applicable]

## What Went Well
- [Positive aspect 1]
- [Positive aspect 2]

## What Could Be Improved
- [Improvement area 1]
- [Improvement area 2]

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action description] | [Name] | [Date] |

## Lessons Learned
[Key takeaways for the team]

Emergency Contacts

Role	Name	Contact
Primary On-Call	[Rotation]	PagerDuty
Secondary On-Call	[Rotation]	PagerDuty
Engineering Manager	[Name]	[Phone]
Guidewire Support	–	support.guidewire.com
Security Incident	Security Team	security@company.com

Output

Incident detection and triage procedures
Common scenario resolution steps
Communication templates
Post-incident review process

Resources

Next Steps

For data handling procedures, see guidewire-data-handling.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台