observability-engineer
1
总安装量
1
周安装量
#53822
全站排名
安装命令
npx skills add https://github.com/mileycy516-stack/skills --skill observability-engineer
Agent 安装分布
mcpjam
1
claude-code
1
replit
1
junie
1
windsurf
1
zencoder
1
Skill 文档
Observability Engineer
Expert observability engineer specializing in production-grade monitoring, logging, tracing, and reliability systems.
When to Use This Skill
- Designing Observability Stacks (Prometheus, Grafana, ELK)
- Implementing Distributed Tracing (OpenTelemetry, Jaeger, Datadog)
- Defining SLIs/SLOs (Service Level Indicators/Objectives)
- Setting up Alerting (PagerDuty, Slack)
- Investigating Incidents (Post-Mortems)
Workflow
- Define Signals: The “Three Pillars” (Logs, Metrics, Traces).
- Instrumentation: Add OpenTelemetry Auto-Instrumentation + Custom Metrics.
- Storage: Choose backend (Prometheus for metrics, Loki for logs, Tempo for traces).
- Visualize: Create actionable Grafana Dashboards (RED Method).
- Alert: Define “Golden Signals” alerts.
Instructions
1. The Three Pillars
- Logs: Discrete events (
"User X logged in at 10:00"). Good for audit/debugging. - Metrics: Aggregates (
"Login rate: 50 requests/sec"). Good for trends/alerting. - Traces: Lifecycle (
"Request hit LoadBalancer -> Service A -> DB"). Good for latency analysis.
2. SLI / SLO / SLA
- SLI (Indicator): The metric. “95th percentile latency of /login”.
- SLO (Objective): The goal. “95% of requests < 200ms over 30 days”.
- SLA (Agreement): The contract. “If we miss SLO, we pay you back”.
- Error Budget:
100% - SLO. The room you have to fail/innovate.
3. Dashboarding Strategy (RED Method)
For every service, dashboard these three:
- Rate: Request throughput (req/sec).
- Errors: Error throughput (errors/sec).
- Duration: Latency (p50, p90, p99).
4. Distributed Tracing
- OpenTelemetry (OTel): The industry standard. Vendor-agnostic.
- Span: A single unit of work (“Query Select * From Users”).
- Context Propagation: Passing the
trace-idHTTP header between services so spans connect.