qa-observability
28
总安装量
28
周安装量
#7356
全站排名
安装命令
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-observability
Agent 安装分布
claude-code
19
opencode
16
cursor
16
gemini-cli
13
codex
13
Skill 文档
QA Observability and Performance Engineering
Use telemetry (logs, metrics, traces, profiles) as a QA signal and a debugging substrate.
Core references:
- OpenTelemetry: https://opentelemetry.io/docs/
- W3C Trace Context: https://www.w3.org/TR/trace-context/
- Google SRE SLOs: https://sre.google/sre-book/service-level-objectives/
Default QA stance
- Treat telemetry as part of acceptance criteria (especially for integration/E2E tests).
- Require correlation: request_id + trace_id (traceparent) across boundaries.
- Prefer SLO-based release gating and burn-rate alerting over raw infra thresholds.
- Budget overhead: sampling, cardinality, retention, and cost are quality constraints.
- Redact PII/secrets by default (logs and attributes).
Core workflows
- Establish the minimum bar (logs + metrics + traces + correlation).
- Instrument with OpenTelemetry (auto-instrument first, then add manual spans for key paths).
- Verify context propagation across service boundaries (traceparent in/out).
- Define SLIs/SLOs and error budget policy; wire burn-rate alerts.
- Make failures diagnosable: capture a trace link + key logs on every test failure.
- Profile and load test only after telemetry is reliable; validate against baselines.
Quick reference
| Task | Recommended default | Notes |
|---|---|---|
| Tracing | OpenTelemetry + Jaeger/Tempo | Prefer OTLP exporters via Collector when possible |
| Metrics | Prometheus + Grafana | Use histograms for latency; watch cardinality |
| Logging | Structured JSON + correlation IDs | Never log secrets/PII; redact aggressively |
| Reliability gates | SLOs + error budgets + burn-rate alerts | Gate releases on sustained burn/regressions |
| Performance | Profiling + load tests + budgets | Add continuous profiling for intermittent issues |
| Zero-code visibility | eBPF (OpenTelemetry zero-code) + continuous profiling (Parca/Pyroscope) | Use when code changes are not feasible |
Navigation
Open these guides when needed:
| If the user needs… | Read | Also use |
|---|---|---|
| A minimal, production-ready baseline | references/core-observability-patterns.md |
assets/checklists/template-observability-readiness-checklist.md |
| Node/Python instrumentation setup | references/opentelemetry-best-practices.md |
assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md, assets/opentelemetry/python/opentelemetry-python-setup.md |
| Working trace propagation across services | references/distributed-tracing-patterns.md |
assets/checklists/template-observability-readiness-checklist.md |
| SLOs, burn-rate alerts, and release gates | references/slo-design-guide.md |
assets/monitoring/slo/slo-definition.yaml, assets/monitoring/slo/prometheus-alert-rules.yaml |
| Profiling/load testing with evidence | references/performance-profiling-guide.md |
assets/load-testing/load-testing-k6.js, assets/load-testing/template-load-test-artillery.yaml |
| A maturity model and roadmap | references/observability-maturity-model.md |
assets/checklists/template-observability-readiness-checklist.md |
| What to avoid and how to fix it | references/anti-patterns-best-practices.md |
assets/checklists/template-observability-readiness-checklist.md |
Implementation guides (deep dives):
references/core-observability-patterns.mdreferences/opentelemetry-best-practices.mdreferences/distributed-tracing-patterns.mdreferences/slo-design-guide.mdreferences/performance-profiling-guide.mdreferences/observability-maturity-model.mdreferences/anti-patterns-best-practices.md
Templates (copy/paste):
assets/checklists/template-observability-readiness-checklist.mdassets/opentelemetry/nodejs/opentelemetry-nodejs-setup.mdassets/opentelemetry/python/opentelemetry-python-setup.mdassets/monitoring/slo/slo-definition.yamlassets/monitoring/slo/prometheus-alert-rules.yamlassets/monitoring/grafana/grafana-dashboard-slo.jsonassets/monitoring/grafana/template-grafana-dashboard-observability.jsonassets/load-testing/load-testing-k6.jsassets/load-testing/template-load-test-artillery.yamlassets/performance/frontend/template-lighthouse-ci.jsonassets/performance/backend/template-nodejs-profiling-config.js
Curated sources:
data/sources.json
Scope boundaries (handoffs)
- Pure infrastructure monitoring (Kubernetes, Docker, CI/CD):
../ops-devops-platform/SKILL.md - Database query optimization (SQL tuning, indexing):
../data-sql-optimization/SKILL.md - Application-level debugging (stack traces, breakpoints):
../qa-debugging/SKILL.md - Test strategy design (coverage, test pyramids):
../qa-testing-strategy/SKILL.md - Resilience patterns (retries, circuit breakers):
../qa-resilience/SKILL.md - Architecture decisions (microservices, event-driven):
../software-architecture-design/SKILL.md
Tool selection notes (2026)
- Default to OpenTelemetry + OTLP + Collector where possible.
- Prefer burn-rate alerting against SLOs over alerting on raw infra metrics.
- Treat sampling, cardinality, and retention as part of quality (not an afterthought).
- When asked to pick vendors/tools, start from
data/sources.jsonand validate time-sensitive claims with current docs/releases if the environment allows it.