kubernetes-specialist

📁 404kidwiz/claude-supercode-skills 📅 Jan 24, 2026
36
总安装量
36
周安装量
#5765
全站排名
安装命令
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill kubernetes-specialist

Agent 安装分布

opencode 27
claude-code 26
gemini-cli 23
cursor 21
github-copilot 15

Skill 文档

Kubernetes Specialist

Purpose

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.

When to Use

  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments
  • Setting up service mesh (Istio, Linkerd) and observability
  • Implementing Kubernetes security and RBAC policies

Quick Start

Invoke this skill when:

  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments

Do NOT invoke when:

  • Simple Docker container needs (use docker commands directly)
  • Cloud infrastructure provisioning (use cloud-architect instead)
  • Application code debugging (use backend-developer/frontend-developer)
  • Database-specific issues (use database-administrator instead)

Decision Framework

Deployment Strategy Selection

├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
│
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
│
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism

Resource Configuration Matrix

Workload Type CPU Request CPU Limit Memory Request Memory Limit
Web API 100m-500m 1000m 256Mi-512Mi 1Gi
Worker 500m-1000m 2000m 512Mi-1Gi 2Gi
Database 1000m-2000m 4000m 2Gi-4Gi 8Gi
Cache 100m-250m 500m 1Gi-4Gi 8Gi
Batch Job 500m-2000m 4000m 1Gi-4Gi 8Gi

Node Pool Strategy

Use Case Instance Type Scaling Cost
System pods t3.large (3 nodes) Fixed Low
Applications m5.xlarge Auto 3-20 Medium
Batch/Spot m5.large-2xlarge Auto 0-50 Very Low
GPU workloads p3.2xlarge Manual High

Red Flags → Escalate

STOP and escalate if:

  • Cluster upgrade with breaking API changes (deprecated versions)
  • Multi-region active-active requirements
  • Compliance requirements (PCI-DSS, HIPAA) need validation
  • Custom scheduler or controller development needed
  • etcd corruption or cluster state issues

Quality Checklist

Cluster Configuration

  • Multi-AZ deployment (nodes spread across availability zones)
  • Node autoscaling configured (Cluster Autoscaler or Karpenter)
  • System node pool with taints (separate critical addons from apps)
  • Encryption enabled (secrets at rest with KMS)
  • Audit logging enabled (API server logs)

Security

  • Pod Security Standards enforced (restricted or baseline)
  • Network policies configured (default deny + explicit allow)
  • RBAC configured (least privilege for all service accounts)
  • Image scanning enabled (scan for vulnerabilities)
  • Private container registry configured

Resource Management

  • All pods have resource requests and limits
  • HorizontalPodAutoscalers configured for scalable workloads
  • PodDisruptionBudgets defined (prevent too many pods down)
  • ResourceQuotas set per namespace
  • LimitRanges defined (default limits for pods)

High Availability

  • Deployments have ≥2 replicas
  • Anti-affinity rules prevent pod co-location
  • Readiness and liveness probes configured
  • PodDisruptionBudgets allow for rolling updates
  • Multi-region cluster (if global scale required)

Observability

  • Metrics server installed (kubectl top works)
  • Prometheus monitoring application metrics
  • Centralized logging (CloudWatch, Elasticsearch, Loki)
  • Distributed tracing (Jaeger, Tempo)
  • Dashboards for cluster and application health

Disaster Recovery

  • Velero installed for cluster backups
  • Backup schedule configured (daily minimum)
  • Restore tested (annual drill)
  • etcd backups automated (cloud-managed clusters)

Additional Resources