devops-sre

📁 ragnarula/cc-plugins 📅 3 days ago
1
总安装量
1
周安装量
#78007
全站排名
安装命令
npx skills add https://github.com/ragnarula/cc-plugins --skill devops-sre

Agent 安装分布

claude-code 1

Skill 文档

DevOps & SRE Engineering

When to Apply

Use this skill when the system involves:

  • CI/CD pipelines and deployment automation
  • Production deployments and rollback strategies
  • Monitoring, alerting, and observability
  • Incident response and on-call procedures
  • SLOs, SLIs, and error budgets
  • Capacity planning and performance management

Mindset

DevOps/SRE engineers think about the entire lifecycle from commit to production and beyond.

Questions to always ask:

  • How do we deploy this safely? How do we roll back?
  • How do we know it’s working? What do we alert on?
  • What’s the SLO? What happens when we miss it?
  • How do we debug this in production?
  • What’s the on-call burden? Is this operable at 3am?
  • How do we handle traffic spikes? Gradual degradation?
  • What’s the blast radius of a bad deploy?

Assumptions to challenge:

  • “It works on my machine” – Production is different. Test in production-like environments.
  • “We’ll monitor it later” – If you can’t observe it, you can’t operate it.
  • “Deploys are safe” – Any change can break things. Deploy progressively.
  • “More alerts are better” – Alert fatigue is real. Alert on symptoms, not causes.
  • “We’ll scale when needed” – Know your limits before you hit them.
  • “Rollback is easy” – Is it? Have you tested it? What about data migrations?

Practices

CI/CD Pipeline

Automate everything from commit to deploy. Fast feedback loops (< 10 min to know if broken). Reproducible builds. Immutable artifacts. Don’t have manual steps in the pipeline, slow feedback loops, or build differently for different environments.

Deployment Strategy

Use progressive rollouts (canary, blue-green, rolling). Define rollback triggers and automate rollback. Separate deploy from release (feature flags). Don’t deploy 100% immediately, rely on manual rollback, or couple deploy with feature enablement.

Observability

Instrument the four golden signals: latency, traffic, errors, saturation. Use structured logging with correlation IDs. Implement distributed tracing. Don’t rely on logs alone, use unstructured logs, or skip tracing in distributed systems.

Alerting

Alert on symptoms (SLO breach), not causes. Page only for actionable, urgent issues. Route non-urgent to tickets. Include runbook links in alerts. Don’t alert on every metric, page for non-actionable issues, or have alerts without runbooks.

SLOs & Error Budgets

Define SLOs based on user experience. Measure SLIs accurately. Use error budget to balance velocity and reliability. Don’t set arbitrary SLOs, measure proxies instead of user experience, or ignore error budget burn.

Incident Response

Have clear escalation paths. Blameless postmortems. Document incidents and learnings. Practice incident response regularly. Don’t blame individuals, skip postmortems, or let learnings rot in docs.

Runbooks

Document common operational tasks. Include debugging steps for known failure modes. Keep runbooks next to alerts. Don’t rely on tribal knowledge, write runbooks that assume context, or let runbooks go stale.

Capacity Planning

Know your limits before you hit them. Load test regularly. Plan for peak, not average. Have scaling playbooks ready. Don’t discover limits in production, test with unrealistic load, or assume linear scaling.

Vocabulary

Use precise terminology:

Instead of Say
“reliable” “99.9% availability SLO” / “< 1% error rate”
“monitored” “SLI dashboards” / “alerting on p99 > 500ms”
“deployed” “canary at 5%” / “blue-green with instant rollback”
“fast deploys” “< 15 min commit-to-prod” / “10 deploys/day”
“observable” “traces, metrics, structured logs with correlation”
“on-call” “PagerDuty rotation” / “< 5 pages/week”

SDD Integration

During Specification:

  • Define SLOs based on user-facing requirements
  • Identify operational requirements (deployment frequency, rollback needs)
  • Clarify observability requirements
  • Establish on-call expectations

During Design:

  • Design for observability from the start
  • Specify deployment strategy and rollback approach
  • Document what metrics/logs/traces each component emits
  • Plan for graceful degradation
  • Identify what runbooks will be needed

During Review:

  • Verify observability is instrumented
  • Check deployment strategy is progressive
  • Confirm rollback is automated and tested
  • Validate alerts are actionable with runbooks
  • Ensure SLIs actually measure SLOs