alerting-strategies

📁 latestaiagents/agent-skills 📅 8 days ago

总安装量

周安装量

#47137

全站排名

安装命令

npx skills add https://github.com/latestaiagents/agent-skills --skill alerting-strategies

Agent 安装分布

mcpjam 1

claude-code 1

replit 1

windsurf 1

zencoder 1

Skill 文档

Alerting Strategies

Get paged for real problems, not noise.

Alerting Philosophy

“Every alert should be actionable, and every action should have a runbook.”

The Goal

Page for symptoms (user impact), not causes (internal metrics)
Every page should require human judgment
False positives erode trust; false negatives cause outages

Alert Severity Levels

Level	Response	Time to Ack	Example
P1/Critical	Page immediately	5 min	Service down, data loss
P2/High	Page during hours	30 min	Degraded performance
P3/Medium	Ticket	Next day	Non-critical feature broken
P4/Low	Review weekly	N/A	Cleanup tasks, warnings

Alert Types

1. Symptom-Based (Recommended)

Alert on what users experience:

# Good: Users are experiencing errors
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1%"
    runbook: "https://wiki/runbooks/high-error-rate"

2. Cause-Based (Use Sparingly)

Alert on infrastructure issues that will cause symptoms:

# Acceptable: Will cause problems soon
- alert: DiskSpaceLow
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Disk space below 10%"

3. SLO-Based (Best Practice)

Alert on error budget consumption:

# Excellent: Based on SLO burn rate
- alert: SLOBurnRateHigh
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Burning error budget 14x faster than sustainable"

Multi-Window, Multi-Burn-Rate Alerts

Google SRE’s recommended approach:

# Fast burn (page immediately)
- alert: SLOBurnRateFast
  expr: |
    (
      job:slo_errors_per_request:ratio_rate1h > (14.4 * 0.001)
      and
      job:slo_errors_per_request:ratio_rate5m > (14.4 * 0.001)
    )
  for: 2m
  labels:
    severity: critical

# Slow burn (page during business hours)
- alert: SLOBurnRateSlow
  expr: |
    (
      job:slo_errors_per_request:ratio_rate6h > (1 * 0.001)
      and
      job:slo_errors_per_request:ratio_rate30m > (1 * 0.001)
    )
  for: 15m
  labels:
    severity: warning

Alert Design Patterns

Pattern 1: Percentage-Based Thresholds

# Alert when error rate exceeds normal baseline
- alert: ErrorRateAnomaly
  expr: |
    (
      sum(rate(http_errors_total[5m]))
      /
      sum(rate(http_requests_total[5m]))
    )
    >
    (
      sum(rate(http_errors_total[1d] offset 1d))
      /
      sum(rate(http_requests_total[1d] offset 1d))
    ) * 2

Pattern 2: Absence Detection

# Alert when service stops reporting
- alert: ServiceDown
  expr: absent(up{job="api-service"} == 1)
  for: 5m

Pattern 3: Derivative-Based

# Alert on rapid change
- alert: LatencySpike
  expr: deriv(http_request_duration_seconds_sum[5m]) > 0.1
  for: 2m

Alert Fatigue Prevention

Checklist Before Creating Alert

â¡ Is this actionable? What should responder do?
â¡ Does a runbook exist?
â¡ Is this a symptom or a cause?
â¡ What's the false positive rate likely to be?
â¡ Can this be a ticket instead of a page?
â¡ Is the threshold based on data, not gut feel?
â¡ Does it have appropriate for/pending duration?

Noisy Alert Remediation

Problem	Solution
Too many pages	Increase threshold or duration
Flapping alerts	Add hysteresis (different up/down thresholds)
Duplicate alerts	Use alert grouping/inhibition
Low-signal alerts	Convert to ticket or remove
Night pages for non-urgent	Route to next business day

Alert Hygiene Process

Weekly:
- Review all alerts that fired
- Tag: actionable / noise / duplicate
- Fix or remove noisy alerts

Monthly:
- Review alert coverage vs incidents
- Identify incidents with no alerts (gaps)
- Identify alerts that never fired (remove?)

Quarterly:
- Full alert audit
- Update thresholds based on SLO performance
- Review on-call burden metrics

Alert Routing

Example PagerDuty Integration

# Route based on service and severity
receivers:
  - name: 'platform-critical'
    pagerduty_configs:
      - service_key: '<platform-team-key>'
        severity: critical

  - name: 'platform-warning'
    pagerduty_configs:
      - service_key: '<platform-team-key>'
        severity: warning

  - name: 'tickets'
    webhook_configs:
      - url: 'https://jira.company.com/webhook'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'platform-warning'
  routes:
    - match:
        severity: critical
      receiver: 'platform-critical'
    - match:
        severity: low
      receiver: 'tickets'

Alert Documentation Template

Every alert should have:

# Alert: HighErrorRate

## What This Means
Error rate has exceeded 1% for the past 5 minutes.
Users are experiencing failures.

## Impact
- Users see error pages
- API consumers get 500 responses
- Potential revenue impact

## First Response
1. Check deployment timeline - recent deploy?
2. Check dependency status (database, external APIs)
3. Look at error logs for specific error messages

## Runbook
[Link to detailed runbook]

## Escalation
If unresolved after 15 minutes, page @platform-lead

## Historical Context
- Normal error rate: 0.01-0.05%
- Common causes: bad deploys, DB issues, traffic spikes

Metrics for Alerting Health

Track these to improve your alerting:

Metric	Target	Why
MTTA (Mean Time to Acknowledge)	<5 min	Are pages noticed?
Pages per week per engineer	<10	Alert fatigue risk
% actionable pages	>80%	Signal vs noise
Incidents with no alerts	<10%	Coverage gaps
False positive rate	<20%	Trust in alerts

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台