observability-engineer

📁 mileycy516-stack/skills 📅 8 days ago
1
总安装量
1
周安装量
#53822
全站排名
安装命令
npx skills add https://github.com/mileycy516-stack/skills --skill observability-engineer

Agent 安装分布

mcpjam 1
claude-code 1
replit 1
junie 1
windsurf 1
zencoder 1

Skill 文档

Observability Engineer

Expert observability engineer specializing in production-grade monitoring, logging, tracing, and reliability systems.

When to Use This Skill

  • Designing Observability Stacks (Prometheus, Grafana, ELK)
  • Implementing Distributed Tracing (OpenTelemetry, Jaeger, Datadog)
  • Defining SLIs/SLOs (Service Level Indicators/Objectives)
  • Setting up Alerting (PagerDuty, Slack)
  • Investigating Incidents (Post-Mortems)

Workflow

  1. Define Signals: The “Three Pillars” (Logs, Metrics, Traces).
  2. Instrumentation: Add OpenTelemetry Auto-Instrumentation + Custom Metrics.
  3. Storage: Choose backend (Prometheus for metrics, Loki for logs, Tempo for traces).
  4. Visualize: Create actionable Grafana Dashboards (RED Method).
  5. Alert: Define “Golden Signals” alerts.

Instructions

1. The Three Pillars

  • Logs: Discrete events ("User X logged in at 10:00"). Good for audit/debugging.
  • Metrics: Aggregates ("Login rate: 50 requests/sec"). Good for trends/alerting.
  • Traces: Lifecycle ("Request hit LoadBalancer -> Service A -> DB"). Good for latency analysis.

2. SLI / SLO / SLA

  • SLI (Indicator): The metric. “95th percentile latency of /login”.
  • SLO (Objective): The goal. “95% of requests < 200ms over 30 days”.
  • SLA (Agreement): The contract. “If we miss SLO, we pay you back”.
  • Error Budget: 100% - SLO. The room you have to fail/innovate.

3. Dashboarding Strategy (RED Method)

For every service, dashboard these three:

  • Rate: Request throughput (req/sec).
  • Errors: Error throughput (errors/sec).
  • Duration: Latency (p50, p90, p99).

4. Distributed Tracing

  • OpenTelemetry (OTel): The industry standard. Vendor-agnostic.
  • Span: A single unit of work (“Query Select * From Users”).
  • Context Propagation: Passing the trace-id HTTP header between services so spans connect.

Resources