ops-devops-platform

📁 vasilyu1983/ai-agents-public 📅 Jan 23, 2026

总安装量

周安装量

#9563

全站排名

安装命令

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ops-devops-platform

Agent 安装分布

claude-code 20

gemini-cli 17

antigravity 16

cursor 16

trae 15

Skill 文档

DevOps Engineering â Quick Reference

This skill equips teams with actionable templates, checklists, and patterns for building self-service platforms, automating infrastructure with GitOps, deploying securely with DevSecOps, scaling with Kubernetes, ensuring reliability through SRE practices, and operating production systems with strong observability.

Modern baseline (2026): IaC (Terraform/OpenTofu/Pulumi), GitOps (Argo CD/Flux), Kubernetes (follow upstream supported releases), OpenTelemetry + Prometheus/Grafana, supply-chain security (SBOM + signing + provenance), policy-as-code (OPA/Gatekeeper or Kyverno), and eBPF-powered networking/security/observability (e.g., Cilium + Tetragon).

Quick Reference

Task	Tool/Framework	Command	When to Use
Infrastructure as Code	Terraform / OpenTofu	`terraform plan && terraform apply`	Provision cloud resources declaratively
GitOps Deployment	Argo CD / Flux	`argocd app sync myapp`	Continuous reconciliation, declarative deployments
Container Build	Docker Engine	`docker build -t app:v1 .`	Package applications with dependencies
Kubernetes Deployment	kubectl / Helm (Kubernetes)	`kubectl apply -f deploy.yaml` / `helm upgrade app ./chart`	Deploy to K8s cluster, manage releases
CI/CD Pipeline	GitHub Actions	Define workflow in `.github/workflows/ci.yml`	Automated testing, building, deploying
Security Scanning	Trivy / Falco / Tetragon	`trivy image myapp:latest`	Vulnerability scanning, runtime security, eBPF enforcement
Monitoring & Alerts	Prometheus + Grafana	Configure ServiceMonitor and AlertManager	Observability, SLO tracking, incident alerts
Load Testing	k6 / Locust	`k6 run load-test.js`	Performance validation, capacity planning
Incident Response	PagerDuty / Opsgenie	Configure escalation policies	On-call management, automated escalation
Platform Engineering	Backstage / Port	Deploy internal developer portal	Self-service infrastructure, golden paths

Decision Tree: Choosing DevOps Approach

What do you need to accomplish?
    ââ Infrastructure provisioning?
    â   ââ Cloud-agnostic â Terraform or OpenTofu (OSS fork)
    â   ââ Programming-first â Pulumi (TypeScript/Python/Go)
    â   ââ AWS-specific â CloudFormation or Terraform/OpenTofu
    â   ââ GCP-specific â Deployment Manager or Terraform/OpenTofu
    â   ââ Azure-specific â ARM/Bicep or Terraform/OpenTofu
    â
    ââ Application deployment?
    â   ââ Kubernetes cluster?
    â   â   ââ Simple deploy â kubectl apply -f manifests/
    â   â   ââ Complex app â Helm charts
    â   â   ââ GitOps workflow â ArgoCD or FluxCD
    â   ââ Serverless?
    â       ââ AWS â Lambda + SAM/Serverless Framework
    â       ââ GCP â Cloud Functions
    â       ââ Azure â Azure Functions
    â
    ââ CI/CD pipeline setup?
    â   ââ GitHub-based â GitHub Actions (template-github-actions.md)
    â   ââ GitLab-based â GitLab CI
    â   ââ Enterprise â Jenkins or Tekton
    â   ââ Security-first â Add SAST/DAST/SCA scans (template-ci-cd.md)
    â
    ââ Observability & monitoring?
    â   ââ Metrics â Prometheus + Grafana
    â   ââ Distributed tracing â Jaeger or OpenTelemetry
    â   ââ Logs â Loki or ELK stack
    â   ââ eBPF-based â Cilium + Hubble (sidecarless)
    â   ââ Unified platform â Datadog or New Relic
    â
    ââ Incident management?
    â   ââ On-call rotation â PagerDuty or Opsgenie
    â   ââ Postmortem â template-postmortem.md
    â   ââ Communication â template-incident-comm.md
    â
    ââ Platform engineering?
    â   ââ Self-service â Backstage or Port (internal developer portal)
    â   ââ Policy enforcement â OPA/Gatekeeper
    â   ââ Golden paths â Template repositories + automation
    â
    ââ Security hardening?
        ââ Container scanning â Trivy or Grype
        ââ Runtime security â Falco or Sysdig
        ââ Secrets management â HashiCorp Vault or cloud-native KMS
        ââ Compliance â CIS Benchmarks, template-security-hardening.md

When to Use This Skill

Claude should invoke this skill when users request:

Platform engineering patterns (self-service developer platforms, internal tools)
GitOps workflows (ArgoCD, FluxCD, declarative infrastructure management)
Infrastructure as Code patterns (Terraform, K8s manifests, policy as code)
CI/CD pipelines with DevSecOps (GitHub Actions, security scanning, SAST/DAST/SCA)
SRE incident management, escalation, and postmortem templates
eBPF-based observability (Cilium, Hubble, kernel-level insights, OpenTelemetry)
Kubernetes operational patterns (day-2 operations, resource management, workload placement)
Cloud-native monitoring (Prometheus, Grafana, unified observability platforms)
Team workflow, communication, handover guides, and runbooks

Resources (Best Practices Guides)

Operational best practices by domain:

DevOps/SRE Operations: references/devops-best-practices.md – Core patterns for safe infrastructure changes, deployments, and incident response
Platform Engineering: references/platform-engineering-patterns.md – Self-service platforms, golden paths, internal developer portals, policy as code
GitOps Workflows: references/gitops-workflows.md – Continuous reconciliation, multi-environment promotion, ArgoCD/FluxCD patterns, progressive delivery
SRE Incident Management: references/sre-incident-management.md – Severity classification, escalation procedures, blameless postmortems, alert correlation, and runbooks
Operational Standards: references/operational-patterns.md – Platform engineering blueprints, CI/CD safety, SLOs, and reliability drills

Each guide includes:

Checklists for completeness and safety
Common anti-patterns and remediations
Step-by-step patterns for safe rollout, rollback, and verification
Decision matrices (e.g., deployment, escalation, monitoring strategy)
Real-world examples and edge case handling

Templates (Copy-Paste Ready)

Production templates organized by tech stack:

AWS Cloud

assets/aws/template-aws-ops.md – AWS service operations and best practices
assets/aws/template-aws-terraform.md – Terraform modules for AWS infrastructure
assets/aws/template-cost-optimization.md – AWS cost optimization strategies

Docker

assets/docker/template-docker-ops.md – Container build, security, and operations

Kafka

assets/kafka/template-kafka-ops.md – Kafka cluster operations and streaming

Terraform & IaC

assets/terraform-iac/template-iac-terraform.md – Infrastructure as Code patterns
assets/terraform-iac/template-module.md – Reusable Terraform modules
assets/terraform-iac/template-env-promotion.md – Environment promotion strategies

CI/CD Pipelines

assets/cicd-pipelines/template-ci-cd.md – General CI/CD patterns
assets/cicd-pipelines/template-github-actions.md – GitHub Actions workflows
assets/cicd-pipelines/template-gitops.md – GitOps deployment patterns
assets/cicd-pipelines/template-release-safety.md – Safe release practices

Monitoring & Observability

assets/monitoring-observability/template-slo.md – Service level objectives
assets/monitoring-observability/template-alert-rules.md – Alert configuration
assets/monitoring-observability/template-observability-slo.md – Observability patterns
assets/monitoring-observability/template-loadtest-perf.md – Load testing and performance

Incident Response

assets/incident-response/template-postmortem.md – Incident postmortems
assets/incident-response/template-runbook-starter.md – Runbook starter template
assets/incident-response/template-incident-comm.md – Incident communication
assets/incident-response/template-incident-response.md – Incident response procedures

Security

assets/security/template-security-hardening.md – Security hardening checklists

Navigation

Resources

Shared Utilities (Centralized patterns â extract, don’t duplicate)

../software-clean-code-standard/utilities/config-validation.md â Zod 3.24+, secrets management (Vault, 1Password, Doppler)
../software-clean-code-standard/utilities/resilience-utilities.md â p-retry v6, circuit breaker, OTel spans
../software-clean-code-standard/utilities/logging-utilities.md â pino v9 + OpenTelemetry integration
../software-clean-code-standard/utilities/observability-utilities.md â OpenTelemetry SDK, tracing, metrics
../software-clean-code-standard/utilities/testing-utilities.md â Test factories, fixtures, mocks
../software-clean-code-standard/references/clean-code-standard.md â Canonical clean code rules (CC-*) for citation

Templates

Data

data/sources.json â Curated external references

Related Skills

Operations & Infrastructure:

../qa-resilience/SKILL.md â Resilience, chaos engineering, and failure handling patterns
../data-sql-optimization/SKILL.md â Database tuning, high availability, and migrations
../qa-observability/SKILL.md â Monitoring, tracing, profiling, and performance optimization
../qa-debugging/SKILL.md â Production debugging, log analysis, and root cause investigation

Security & Compliance:

../software-security-appsec/SKILL.md â Application-layer security patterns and OWASP best practices

Software Development:

../software-backend/SKILL.md â Service-level design and integration patterns
../software-architecture-design/SKILL.md â System design, scalability, and architectural patterns
../dev-api-design/SKILL.md â RESTful API design and versioning
../git-workflow/SKILL.md â Git branching strategies and CI/CD integration

Optional: AI/Automation (Related Skills):

../ai-mlops/SKILL.md â ML model deployment, monitoring, and lifecycle management

Cost Governance & Capacity Planning

assets/cost-governance/template-cost-governance.md â Production cost control for cloud infrastructure.

Key Sections

Cost Governance Framework â Tagging strategy, budget alerts, anomaly detection
Cloud Cost Optimization â Right-sizing, reserved capacity, storage tiering
Kubernetes Cost Control â Resource requests/limits, quotas, autoscaler config
Capacity Planning â Utilization baseline, growth projections, scaling triggers
FinOps Practices â Monthly review agenda, optimization workflow

Do / Avoid

GOOD: Do

Tag all resources at creation time
Set budget alerts before hitting limits
Review right-sizing recommendations monthly
Use spot/preemptible for fault-tolerant workloads
Set Kubernetes resource requests on all pods
Enable cluster autoscaler with scale-down
Document capacity planning assumptions
Run postmortems after every incident

BAD: Avoid

Deploying without cost tags
Running dev resources 24/7
Over-provisioning “just in case”
Ignoring reserved capacity opportunities
Disabling scale-down to “avoid disruption”
Alert fatigue (too many low-priority alerts)
Snowflake infrastructure (manual, undocumented)
“Clickops” drift (changes outside IaC)

Anti-Patterns

Anti-Pattern	Problem	Fix
No tagging	Can’t attribute costs	Enforce tags in CI/CD
Dev runs 24/7	70% waste	Scheduled shutdown
Over-provisioned	Paying for unused capacity	Monthly right-sizing
No reservations	Paying on-demand premium	60-70% coverage target
Alert fatigue	Real issues missed	SLO-based alerting, tuned thresholds
Snowflake infra	Undocumented, unreproducible	Everything in Terraform/IaC
No postmortems	Same incidents repeat	Blameless postmortem for every SEV1/2

Optional: AI/Automation (AIOps)

Note: AI can assist with analysis and triage, but infrastructure/cost/incident decisions require human approval and an audit trail (especially anything destructive or irreversible).

AIOps Capabilities (2026)

Self-Healing Systems:

AI-powered anomaly detection to predict failures before they happen
Automated remediation flows that trigger rollbacks or config changes
Intelligent test selection and risk-based change scoring in CI/CD
Causal graph analysis for instant root cause identification

Automated Operations:

Unused resource detection and notification
Right-sizing recommendation generation
Alert summarization and correlation (reduce noise by 90%+)
Runbook step suggestions and automated execution

AI-Assisted Analysis

Cost trend prediction and anomaly detection
Incident pattern identification across services
Post-mortem theme extraction
Capacity planning predictions

Platform Engineering + AI

Platform teams increasingly embed AI capabilities directly into the platform:

Multi-agent orchestration for code generation, security validation, deployment
Intelligent defaults and guardrails that scale across teams

Bounded Claims

AI recommendations need validation before action
Automated deletions require approval workflow
Cost predictions are estimates, not guarantees
Runbook suggestions need SRE verification
Self-healing actions should have human-defined policies and audit trails

Operational Deep Dives

See references/operational-patterns.md for:

Platform engineering blueprints and GitOps reconciliation checklists
DevSecOps CI/CD gates, SLO/SLI playbooks, and rollout verification steps
Observability patterns (eBPF), incident noise reduction, and reliability drills

External Resources

See data/sources.json for curated sources organized by tech stack:

Cloud Platforms: AWS, GCP, Azure documentation and best practices
Container Orchestration: Kubernetes, Helm, Kustomize, Docker
Infrastructure as Code: Terraform, OpenTofu, Pulumi, CloudFormation, ARM templates
CI/CD & GitOps: GitHub Actions, GitLab CI, Jenkins, ArgoCD, FluxCD
Streaming: Apache Kafka, Confluent, Strimzi
Monitoring: Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, Cilium/Hubble, Tetragon
SRE: Google SRE books, incident response patterns
Security: OWASP DevSecOps, CIS Benchmarks, Trivy, Falco
Tools: kubectl, k9s, stern, Cosign, Syft, Terragrunt

Use this skill as a hub for safe, modern, and production-grade DevOps patterns. All templates and patterns are operationalâno theory or book summaries.

Trend Awareness Protocol

When users ask recommendation questions about DevOps, platform engineering, or cloud infrastructure, validate time-sensitive details (versions, deprecations, licensing, major releases) against primary sources.

Trigger Conditions

“What’s the best tool for [Kubernetes/IaC/CI-CD/monitoring]?”
“What should I use for [container orchestration/GitOps/observability]?”
“What’s the latest in DevOps/platform engineering?”
“Current best practices for [Terraform/ArgoCD/Prometheus]?”
“Is [tool/approach] still relevant in 2026?”
“[Kubernetes] vs [alternative]?” or “[ArgoCD] vs [FluxCD]?”
“Best cloud provider for [use case]?”
“What orchestration/monitoring tool should I use?”

Minimum Verification (Preferred Order)

Check the official docs + release notes linked in data/sources.json for the specific tools you recommend.
If internet access is available, confirm recent releases, breaking changes, and deprecations from those release pages.
If internet access is not available, state that versions may have changed and focus on stable selection criteria (operational fit, ecosystem, maturity, team skills, compliance).

What to Report

After searching, provide:

Current landscape: What tools/approaches are popular NOW (not 6 months ago)
Emerging trends: New tools, patterns, or practices gaining traction
Deprecated/declining: Tools/approaches losing relevance or support
Recommendation: Based on fresh data, not just static knowledge

Example Topics (verify with fresh search)

Kubernetes versions and ecosystem tools (1.33+, Cilium, Gateway API)
Infrastructure as Code (Terraform, OpenTofu, Pulumi, CDK)
GitOps platforms (ArgoCD, FluxCD, Codefresh)
Observability stacks (OpenTelemetry, Grafana stack, Datadog)
Platform engineering tools (Backstage, Port, Kratix)
CI/CD platforms (GitHub Actions, GitLab CI, Dagger)
Cloud-native security (Falco, Trivy, policy engines)

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台