observability-design
npx skills add https://github.com/rsmdt/the-startup --skill observability-design
Agent 安装分布
Skill 文档
Observability Patterns
When to Use
- Designing monitoring infrastructure for new services
- Defining SLIs, SLOs, and error budgets for reliability
- Implementing distributed tracing across microservices
- Creating alert rules that minimize noise and maximize signal
- Building dashboards for operations and business stakeholders
- Establishing incident response and postmortem processes
- Diagnosing production issues through telemetry correlation
Philosophy
You cannot fix what you cannot see. Observability is not about collecting data – it is about answering questions you have not thought to ask yet. Good observability turns every incident into a learning opportunity and every metric into actionable insight.
The Three Pillars
Metrics
Numeric measurements aggregated over time. Best for understanding system behavior at scale.
Characteristics:
- Highly efficient storage (aggregated values)
- Support mathematical operations (rates, percentiles)
- Enable alerting on thresholds
- Limited cardinality (avoid high-cardinality labels)
Types:
| Type | Use Case | Example |
|---|---|---|
| Counter | Cumulative values that only increase | Total requests, errors, bytes sent |
| Gauge | Values that go up and down | Current memory, active connections |
| Histogram | Distribution of values in buckets | Request latency, payload sizes |
| Summary | Similar to histogram, calculated client-side | Pre-computed percentiles |
Logs
Immutable records of discrete events. Best for understanding specific occurrences.
Characteristics:
- Rich context and arbitrary data
- Expensive to store and query at scale
- Essential for debugging specific issues
- Should be structured (JSON) for parseability
Structure:
Required fields:
- timestamp: ISO 8601 format with timezone
- level: ERROR, WARN, INFO, DEBUG
- message: Human-readable description
- service: Service identifier
- trace_id: Correlation identifier
Context fields:
- user_id: Sanitized user identifier
- request_id: Request correlation
- duration_ms: Operation timing
- error_type: Classification for errors
Traces
Records of request flow across distributed systems. Best for understanding causality and latency.
Characteristics:
- Show request path through services
- Identify latency bottlenecks
- Reveal dependencies and failure points
- Higher overhead than metrics
Components:
- Trace: Complete request journey
- Span: Single operation within a trace
- Context: Metadata propagated across services
SLI/SLO/SLA Framework
Service Level Indicators (SLIs)
Quantitative measures of service behavior from the user perspective.
Common SLI categories:
| Category | Measures | Example SLI |
|---|---|---|
| Availability | Service is responding | % of successful requests |
| Latency | Response speed | % of requests < 200ms |
| Throughput | Capacity | Requests processed per second |
| Error Rate | Correctness | % of requests without errors |
| Freshness | Data currency | % of data < 1 minute old |
SLI specification:
SLI: Request Latency
Definition: Time from request received to response sent
Measurement: Server-side histogram at p50, p95, p99
Exclusions: Health checks, internal tooling
Data source: Application metrics
Service Level Objectives (SLOs)
Target reliability levels for SLIs over a time window.
SLO formula:
SLO = (Good events / Total events) >= Target over Window
Example:
99.9% of requests complete successfully in < 200ms
measured over a 30-day rolling window
Setting SLO targets:
- Start with current baseline performance
- Consider user expectations and business impact
- Balance reliability investment against feature velocity
- Document error budget policy
Error Budgets
The allowed amount of unreliability within an SLO.
Calculation:
Error Budget = 1 - SLO Target
99.9% SLO = 0.1% error budget
= 43.2 minutes downtime per 30 days
= 8.64 seconds per day
Error budget policies:
- Budget remaining: Continue feature development
- Budget depleted: Focus on reliability work
- Budget burning fast: Freeze deploys, investigate
Alerting Strategies
Symptom-Based Alerts
Alert on user-visible symptoms, not internal causes.
Good alerts:
- Error rate exceeds threshold (users experiencing failures)
- Latency SLO at risk (users experiencing slowness)
- Queue depth growing (backlog affecting users)
Poor alerts:
- CPU at 80% (may not affect users)
- Pod restarted (self-healing, may not affect users)
- Disk at 70% (not yet impacting service)
Multi-Window, Multi-Burn-Rate Alerts
Detect fast burns quickly, slow burns before budget depletion.
Configuration:
Fast burn: 14.4x burn rate over 1 hour
- Fires in 1 hour if issue persists
- Catches severe incidents quickly
Slow burn: 3x burn rate over 3 days
- Fires before 30-day budget depletes
- Catches gradual degradation
Alert Fatigue Prevention
Strategies:
- Alert only on actionable issues
- Consolidate related alerts
- Set meaningful thresholds (not arbitrary)
- Require sustained condition before firing
- Include runbook links in every alert
- Review and prune alerts quarterly
Alert quality checklist:
- Can someone take action right now?
- Is the severity appropriate?
- Does it include enough context?
- Is there a runbook linked?
- Has it fired false positives recently?
Dashboard Design
Hierarchy of Dashboards
Service Health Overview:
- High-level SLO status
- Error budget consumption
- Key business metrics
- Designed for quick triage
Deep-Dive Diagnostic:
- Detailed metrics breakdown
- Resource utilization
- Dependency health
- Designed for investigation
Business Metrics:
- User-facing KPIs
- Conversion and engagement
- Revenue impact
- Designed for stakeholders
Dashboard Principles
- Answer specific questions, not show all data
- Use consistent color coding (green=good, red=bad)
- Show time ranges appropriate to the metric
- Include context (deployments, incidents) on graphs
- Mobile-responsive for on-call use
- Provide drill-down paths to detailed views
Essential Panels
| Panel | Purpose | Audience |
|---|---|---|
| SLO Status | Current reliability vs target | Everyone |
| Error Budget | Remaining budget and burn rate | Engineering |
| Request Rate | Traffic patterns and anomalies | Operations |
| Latency Distribution | p50, p95, p99 over time | Engineering |
| Error Breakdown | Errors by type and endpoint | Engineering |
| Dependency Health | Status of upstream services | Operations |
Best Practices
- Correlate metrics, logs, and traces with shared identifiers
- Instrument code at service boundaries, not everywhere
- Use structured logging with consistent field names
- Set retention policies appropriate to data value
- Test alerts in staging before production
- Document SLOs and share with stakeholders
- Conduct regular game days to validate observability
- Automate common diagnostic procedures in runbooks
Anti-Patterns
- Alert on every possible metric (alert fatigue)
- Create dashboards without specific questions in mind
- Log without structure or correlation IDs
- Set SLOs without measuring current baseline
- Ignore error budget policies when convenient
- Treat all alerts with equal severity
- Store high-cardinality data in metrics (use logs/traces)
- Skip postmortems when issues resolve themselves
References
- references/monitoring-patterns.md – Detailed implementation patterns and examples