ops-devops-platform
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ops-devops-platform
Agent 安装分布
Skill 文档
DevOps Engineering â Quick Reference
This skill equips teams with actionable templates, checklists, and patterns for building self-service platforms, automating infrastructure with GitOps, deploying securely with DevSecOps, scaling with Kubernetes, ensuring reliability through SRE practices, and operating production systems with strong observability.
Modern baseline (2026): IaC (Terraform/OpenTofu/Pulumi), GitOps (Argo CD/Flux), Kubernetes (follow upstream supported releases), OpenTelemetry + Prometheus/Grafana, supply-chain security (SBOM + signing + provenance), policy-as-code (OPA/Gatekeeper or Kyverno), and eBPF-powered networking/security/observability (e.g., Cilium + Tetragon).
Quick Reference
| Task | Tool/Framework | Command | When to Use |
|---|---|---|---|
| Infrastructure as Code | Terraform / OpenTofu | terraform plan && terraform apply |
Provision cloud resources declaratively |
| GitOps Deployment | Argo CD / Flux | argocd app sync myapp |
Continuous reconciliation, declarative deployments |
| Container Build | Docker Engine | docker build -t app:v1 . |
Package applications with dependencies |
| Kubernetes Deployment | kubectl / Helm (Kubernetes) | kubectl apply -f deploy.yaml / helm upgrade app ./chart |
Deploy to K8s cluster, manage releases |
| CI/CD Pipeline | GitHub Actions | Define workflow in .github/workflows/ci.yml |
Automated testing, building, deploying |
| Security Scanning | Trivy / Falco / Tetragon | trivy image myapp:latest |
Vulnerability scanning, runtime security, eBPF enforcement |
| Monitoring & Alerts | Prometheus + Grafana | Configure ServiceMonitor and AlertManager | Observability, SLO tracking, incident alerts |
| Load Testing | k6 / Locust | k6 run load-test.js |
Performance validation, capacity planning |
| Incident Response | PagerDuty / Opsgenie | Configure escalation policies | On-call management, automated escalation |
| Platform Engineering | Backstage / Port | Deploy internal developer portal | Self-service infrastructure, golden paths |
Decision Tree: Choosing DevOps Approach
What do you need to accomplish?
ââ Infrastructure provisioning?
â ââ Cloud-agnostic â Terraform or OpenTofu (OSS fork)
â ââ Programming-first â Pulumi (TypeScript/Python/Go)
â ââ AWS-specific â CloudFormation or Terraform/OpenTofu
â ââ GCP-specific â Deployment Manager or Terraform/OpenTofu
â ââ Azure-specific â ARM/Bicep or Terraform/OpenTofu
â
ââ Application deployment?
â ââ Kubernetes cluster?
â â ââ Simple deploy â kubectl apply -f manifests/
â â ââ Complex app â Helm charts
â â ââ GitOps workflow â ArgoCD or FluxCD
â ââ Serverless?
â ââ AWS â Lambda + SAM/Serverless Framework
â ââ GCP â Cloud Functions
â ââ Azure â Azure Functions
â
ââ CI/CD pipeline setup?
â ââ GitHub-based â GitHub Actions (template-github-actions.md)
â ââ GitLab-based â GitLab CI
â ââ Enterprise â Jenkins or Tekton
â ââ Security-first â Add SAST/DAST/SCA scans (template-ci-cd.md)
â
ââ Observability & monitoring?
â ââ Metrics â Prometheus + Grafana
â ââ Distributed tracing â Jaeger or OpenTelemetry
â ââ Logs â Loki or ELK stack
â ââ eBPF-based â Cilium + Hubble (sidecarless)
â ââ Unified platform â Datadog or New Relic
â
ââ Incident management?
â ââ On-call rotation â PagerDuty or Opsgenie
â ââ Postmortem â template-postmortem.md
â ââ Communication â template-incident-comm.md
â
ââ Platform engineering?
â ââ Self-service â Backstage or Port (internal developer portal)
â ââ Policy enforcement â OPA/Gatekeeper
â ââ Golden paths â Template repositories + automation
â
ââ Security hardening?
ââ Container scanning â Trivy or Grype
ââ Runtime security â Falco or Sysdig
ââ Secrets management â HashiCorp Vault or cloud-native KMS
ââ Compliance â CIS Benchmarks, template-security-hardening.md
When to Use This Skill
Claude should invoke this skill when users request:
- Platform engineering patterns (self-service developer platforms, internal tools)
- GitOps workflows (ArgoCD, FluxCD, declarative infrastructure management)
- Infrastructure as Code patterns (Terraform, K8s manifests, policy as code)
- CI/CD pipelines with DevSecOps (GitHub Actions, security scanning, SAST/DAST/SCA)
- SRE incident management, escalation, and postmortem templates
- eBPF-based observability (Cilium, Hubble, kernel-level insights, OpenTelemetry)
- Kubernetes operational patterns (day-2 operations, resource management, workload placement)
- Cloud-native monitoring (Prometheus, Grafana, unified observability platforms)
- Team workflow, communication, handover guides, and runbooks
Resources (Best Practices Guides)
Operational best practices by domain:
- DevOps/SRE Operations: references/devops-best-practices.md – Core patterns for safe infrastructure changes, deployments, and incident response
- Platform Engineering: references/platform-engineering-patterns.md – Self-service platforms, golden paths, internal developer portals, policy as code
- GitOps Workflows: references/gitops-workflows.md – Continuous reconciliation, multi-environment promotion, ArgoCD/FluxCD patterns, progressive delivery
- SRE Incident Management: references/sre-incident-management.md – Severity classification, escalation procedures, blameless postmortems, alert correlation, and runbooks
- Operational Standards: references/operational-patterns.md – Platform engineering blueprints, CI/CD safety, SLOs, and reliability drills
Each guide includes:
- Checklists for completeness and safety
- Common anti-patterns and remediations
- Step-by-step patterns for safe rollout, rollback, and verification
- Decision matrices (e.g., deployment, escalation, monitoring strategy)
- Real-world examples and edge case handling
Templates (Copy-Paste Ready)
Production templates organized by tech stack:
AWS Cloud
- assets/aws/template-aws-ops.md – AWS service operations and best practices
- assets/aws/template-aws-terraform.md – Terraform modules for AWS infrastructure
- assets/aws/template-cost-optimization.md – AWS cost optimization strategies
GCP Cloud
- assets/gcp/template-gcp-ops.md – GCP service operations
- assets/gcp/template-gcp-terraform.md – Terraform modules for GCP
Azure Cloud
- assets/azure/template-azure-ops.md – Azure service operations
Kubernetes
- assets/kubernetes/template-kubernetes-ops.md – Day-to-day K8s operations
- assets/kubernetes/template-ha-dr.md – High availability and disaster recovery
- assets/kubernetes/template-platform-api.md – Platform API patterns
- assets/kubernetes/template-k8s-deploy.yaml – Deployment manifests
Docker
- assets/docker/template-docker-ops.md – Container build, security, and operations
Kafka
- assets/kafka/template-kafka-ops.md – Kafka cluster operations and streaming
Terraform & IaC
- assets/terraform-iac/template-iac-terraform.md – Infrastructure as Code patterns
- assets/terraform-iac/template-module.md – Reusable Terraform modules
- assets/terraform-iac/template-env-promotion.md – Environment promotion strategies
CI/CD Pipelines
- assets/cicd-pipelines/template-ci-cd.md – General CI/CD patterns
- assets/cicd-pipelines/template-github-actions.md – GitHub Actions workflows
- assets/cicd-pipelines/template-gitops.md – GitOps deployment patterns
- assets/cicd-pipelines/template-release-safety.md – Safe release practices
Monitoring & Observability
- assets/monitoring-observability/template-slo.md – Service level objectives
- assets/monitoring-observability/template-alert-rules.md – Alert configuration
- assets/monitoring-observability/template-observability-slo.md – Observability patterns
- assets/monitoring-observability/template-loadtest-perf.md – Load testing and performance
Incident Response
- assets/incident-response/template-postmortem.md – Incident postmortems
- assets/incident-response/template-runbook-starter.md – Runbook starter template
- assets/incident-response/template-incident-comm.md – Incident communication
- assets/incident-response/template-incident-response.md – Incident response procedures
Security
- assets/security/template-security-hardening.md – Security hardening checklists
Navigation
Resources
- references/operational-patterns.md
- references/sre-incident-management.md
- references/devops-best-practices.md
- references/platform-engineering-patterns.md
- references/gitops-workflows.md
Shared Utilities (Centralized patterns â extract, don’t duplicate)
- ../software-clean-code-standard/utilities/config-validation.md â Zod 3.24+, secrets management (Vault, 1Password, Doppler)
- ../software-clean-code-standard/utilities/resilience-utilities.md â p-retry v6, circuit breaker, OTel spans
- ../software-clean-code-standard/utilities/logging-utilities.md â pino v9 + OpenTelemetry integration
- ../software-clean-code-standard/utilities/observability-utilities.md â OpenTelemetry SDK, tracing, metrics
- ../software-clean-code-standard/utilities/testing-utilities.md â Test factories, fixtures, mocks
- ../software-clean-code-standard/references/clean-code-standard.md â Canonical clean code rules (
CC-*) for citation
Templates
- assets/incident-response/template-postmortem.md
- assets/incident-response/template-runbook-starter.md
- assets/incident-response/template-incident-comm.md
- assets/incident-response/template-incident-response.md
- assets/docker/template-docker-ops.md
- assets/security/template-security-hardening.md
- assets/azure/template-azure-ops.md
- assets/gcp/template-gcp-terraform.md
- assets/gcp/template-gcp-ops.md
- assets/cicd-pipelines/template-release-safety.md
- assets/cicd-pipelines/template-gitops.md
- assets/cicd-pipelines/template-ci-cd.md
- assets/cicd-pipelines/template-github-actions.md
- assets/kafka/template-kafka-ops.md
- assets/aws/template-aws-terraform.md
- assets/aws/template-aws-ops.md
- assets/aws/template-cost-optimization.md
- assets/monitoring-observability/template-slo.md
- assets/monitoring-observability/template-loadtest-perf.md
- assets/monitoring-observability/template-alert-rules.md
- assets/monitoring-observability/template-observability-slo.md
- assets/kubernetes/template-k8s-deploy.yaml
- assets/kubernetes/template-platform-api.md
- assets/kubernetes/template-kubernetes-ops.md
- assets/kubernetes/template-ha-dr.md
- assets/terraform-iac/template-env-promotion.md
- assets/terraform-iac/template-iac-terraform.md
- assets/terraform-iac/template-module.md
Data
- data/sources.json â Curated external references
Related Skills
Operations & Infrastructure:
- ../qa-resilience/SKILL.md â Resilience, chaos engineering, and failure handling patterns
- ../data-sql-optimization/SKILL.md â Database tuning, high availability, and migrations
- ../qa-observability/SKILL.md â Monitoring, tracing, profiling, and performance optimization
- ../qa-debugging/SKILL.md â Production debugging, log analysis, and root cause investigation
Security & Compliance:
- ../software-security-appsec/SKILL.md â Application-layer security patterns and OWASP best practices
Software Development:
- ../software-backend/SKILL.md â Service-level design and integration patterns
- ../software-architecture-design/SKILL.md â System design, scalability, and architectural patterns
- ../dev-api-design/SKILL.md â RESTful API design and versioning
- ../git-workflow/SKILL.md â Git branching strategies and CI/CD integration
Optional: AI/Automation (Related Skills):
- ../ai-mlops/SKILL.md â ML model deployment, monitoring, and lifecycle management
Cost Governance & Capacity Planning
assets/cost-governance/template-cost-governance.md â Production cost control for cloud infrastructure.
Key Sections
- Cost Governance Framework â Tagging strategy, budget alerts, anomaly detection
- Cloud Cost Optimization â Right-sizing, reserved capacity, storage tiering
- Kubernetes Cost Control â Resource requests/limits, quotas, autoscaler config
- Capacity Planning â Utilization baseline, growth projections, scaling triggers
- FinOps Practices â Monthly review agenda, optimization workflow
Do / Avoid
GOOD: Do
- Tag all resources at creation time
- Set budget alerts before hitting limits
- Review right-sizing recommendations monthly
- Use spot/preemptible for fault-tolerant workloads
- Set Kubernetes resource requests on all pods
- Enable cluster autoscaler with scale-down
- Document capacity planning assumptions
- Run postmortems after every incident
BAD: Avoid
- Deploying without cost tags
- Running dev resources 24/7
- Over-provisioning “just in case”
- Ignoring reserved capacity opportunities
- Disabling scale-down to “avoid disruption”
- Alert fatigue (too many low-priority alerts)
- Snowflake infrastructure (manual, undocumented)
- “Clickops” drift (changes outside IaC)
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No tagging | Can’t attribute costs | Enforce tags in CI/CD |
| Dev runs 24/7 | 70% waste | Scheduled shutdown |
| Over-provisioned | Paying for unused capacity | Monthly right-sizing |
| No reservations | Paying on-demand premium | 60-70% coverage target |
| Alert fatigue | Real issues missed | SLO-based alerting, tuned thresholds |
| Snowflake infra | Undocumented, unreproducible | Everything in Terraform/IaC |
| No postmortems | Same incidents repeat | Blameless postmortem for every SEV1/2 |
Optional: AI/Automation (AIOps)
Note: AI can assist with analysis and triage, but infrastructure/cost/incident decisions require human approval and an audit trail (especially anything destructive or irreversible).
AIOps Capabilities (2026)
Self-Healing Systems:
- AI-powered anomaly detection to predict failures before they happen
- Automated remediation flows that trigger rollbacks or config changes
- Intelligent test selection and risk-based change scoring in CI/CD
- Causal graph analysis for instant root cause identification
Automated Operations:
- Unused resource detection and notification
- Right-sizing recommendation generation
- Alert summarization and correlation (reduce noise by 90%+)
- Runbook step suggestions and automated execution
AI-Assisted Analysis
- Cost trend prediction and anomaly detection
- Incident pattern identification across services
- Post-mortem theme extraction
- Capacity planning predictions
Platform Engineering + AI
Platform teams increasingly embed AI capabilities directly into the platform:
- Multi-agent orchestration for code generation, security validation, deployment
- Intelligent defaults and guardrails that scale across teams
Bounded Claims
- AI recommendations need validation before action
- Automated deletions require approval workflow
- Cost predictions are estimates, not guarantees
- Runbook suggestions need SRE verification
- Self-healing actions should have human-defined policies and audit trails
Operational Deep Dives
See references/operational-patterns.md for:
- Platform engineering blueprints and GitOps reconciliation checklists
- DevSecOps CI/CD gates, SLO/SLI playbooks, and rollout verification steps
- Observability patterns (eBPF), incident noise reduction, and reliability drills
External Resources
See data/sources.json for curated sources organized by tech stack:
- Cloud Platforms: AWS, GCP, Azure documentation and best practices
- Container Orchestration: Kubernetes, Helm, Kustomize, Docker
- Infrastructure as Code: Terraform, OpenTofu, Pulumi, CloudFormation, ARM templates
- CI/CD & GitOps: GitHub Actions, GitLab CI, Jenkins, ArgoCD, FluxCD
- Streaming: Apache Kafka, Confluent, Strimzi
- Monitoring: Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, Cilium/Hubble, Tetragon
- SRE: Google SRE books, incident response patterns
- Security: OWASP DevSecOps, CIS Benchmarks, Trivy, Falco
- Tools: kubectl, k9s, stern, Cosign, Syft, Terragrunt
Use this skill as a hub for safe, modern, and production-grade DevOps patterns. All templates and patterns are operationalâno theory or book summaries.
Trend Awareness Protocol
When users ask recommendation questions about DevOps, platform engineering, or cloud infrastructure, validate time-sensitive details (versions, deprecations, licensing, major releases) against primary sources.
Trigger Conditions
- “What’s the best tool for [Kubernetes/IaC/CI-CD/monitoring]?”
- “What should I use for [container orchestration/GitOps/observability]?”
- “What’s the latest in DevOps/platform engineering?”
- “Current best practices for [Terraform/ArgoCD/Prometheus]?”
- “Is [tool/approach] still relevant in 2026?”
- “[Kubernetes] vs [alternative]?” or “[ArgoCD] vs [FluxCD]?”
- “Best cloud provider for [use case]?”
- “What orchestration/monitoring tool should I use?”
Minimum Verification (Preferred Order)
- Check the official docs + release notes linked in data/sources.json for the specific tools you recommend.
- If internet access is available, confirm recent releases, breaking changes, and deprecations from those release pages.
- If internet access is not available, state that versions may have changed and focus on stable selection criteria (operational fit, ecosystem, maturity, team skills, compliance).
What to Report
After searching, provide:
- Current landscape: What tools/approaches are popular NOW (not 6 months ago)
- Emerging trends: New tools, patterns, or practices gaining traction
- Deprecated/declining: Tools/approaches losing relevance or support
- Recommendation: Based on fresh data, not just static knowledge
Example Topics (verify with fresh search)
- Kubernetes versions and ecosystem tools (1.33+, Cilium, Gateway API)
- Infrastructure as Code (Terraform, OpenTofu, Pulumi, CDK)
- GitOps platforms (ArgoCD, FluxCD, Codefresh)
- Observability stacks (OpenTelemetry, Grafana stack, Datadog)
- Platform engineering tools (Backstage, Port, Kratix)
- CI/CD platforms (GitHub Actions, GitLab CI, Dagger)
- Cloud-native security (Falco, Trivy, policy engines)