sre

📁 ionfury/homelab 📅 3 days ago

总安装量

周安装量

#30087

全站排名

安装命令

npx skills add https://github.com/ionfury/homelab --skill sre

Agent 安装分布

opencode 9

gemini-cli 9

amp 9

cline 9

github-copilot 9

codex 9

Skill 文档

Cluster access, KUBECONFIG patterns, and internal service URLs are in the k8s skill.

Debugging Kubernetes Incidents

Core Principles

5 Whys Analysis – NEVER stop at symptoms. Ask “why” until you reach the root cause.
Read-Only Investigation – Observe and analyze, never modify resources on integration/live. Dev cluster permits direct mutations for debugging (see troubleshooter agent boundaries)
Multi-Source Correlation – Combine logs, events, metrics for complete picture
Research Unknown Services – Check documentation before deep investigation
Zero Alert Tolerance – Every firing alert must be addressed immediately: fix the root cause, or as a last resort, create a declarative Silence CR with justification. Never ignore, defer, or dismiss a firing alert.

The 5 Whys Analysis (CRITICAL)

You MUST apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.

How to Apply

Start with the observed symptom
Ask “Why did this happen?” for each answer
Continue until you reach an actionable root cause (typically 5 levels)

Example

Symptom: Helm install failed with "context deadline exceeded"

Why #1: Why did Helm timeout?
  â Pods never became Ready

Why #2: Why weren't pods Ready?
  â Pods stuck in Pending state

Why #3: Why were pods Pending?
  â PVCs couldn't bind (StorageClass "fast" not found)

Why #4: Why was StorageClass missing?
  â longhorn-storage Kustomization failed to apply

Why #5: Why did the Kustomization fail?
  â numberOfReplicas was integer instead of string

ROOT CAUSE: YAML type coercion issue
FIX: Use properly typed variable for StorageClass parameters

Red Flags You Haven’t Reached Root Cause

Your “fix” is increasing a timeout or retry count
Your “fix” addresses the symptom, not what caused it
You can still ask “but why did THAT happen?”
Multiple issues share the same underlying cause

BAD:  "Helm timed out â increase timeout to 15m"
GOOD: "Helm timed out â ... â Kustomization type error â fix YAML"

Investigation Phases

Phase 1: Triage

Confirm cluster – Ask user: “Which cluster? (dev/integration/live)”
Assess severity – P1 (down) / P2 (degraded) / P3 (minor) / P4 (cosmetic)
Identify scope – Pod / Deployment / Namespace / Cluster-wide

Phase 2: Data Collection

# Pod status and events
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>

# Logs (current and previous)
kubectl logs <pod> -n <namespace> --tail=100
kubectl logs <pod> -n <namespace> --previous

# Events timeline
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top pods -n <namespace>

Metrics and alerts via kubectl exec (Prometheus is behind OAuth2 Proxy â DNS URLs won’t work for API queries):

# Check firing alerts
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state == "firing")'

# Pod restart metrics
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=increase(kube_pod_container_status_restarts_total[1h])>0' | jq '.data.result'

Phase 3: Correlation

Extract timestamps from logs, events, metrics
Identify what happened FIRST (root cause)
Trace the cascade of effects

Phase 4: Root Cause (5 Whys)

Apply 5 Whys analysis. Validate:

Temporal: Did it happen BEFORE the symptom?
Causal: Does it logically explain the symptom?
Evidence: Is there supporting data?
Complete: Have you asked “why” enough times?

Phase 5: Remediation

Use AskUserQuestion tool to present fix options when multiple valid approaches exist.

Provide recommendations only (read-only investigation):

Immediate: Rollback, scale, restart
Permanent: Code/config fixes
Prevention: Alerts, quotas, tests

Quick Diagnosis

Symptom	First Check	Common Cause
`ImagePullBackOff`	`describe pod` events	Wrong image/registry auth
`Pending`	Events, node capacity	Insufficient resources
`CrashLoopBackOff`	`logs --previous`	App error, missing config
`OOMKilled`	Memory limits	Memory leak, limits too low
`Unhealthy`	Probe config	Slow startup, wrong endpoint
Service unreachable	Hubble dropped traffic	Network policy blocking
Can’t reach database	Hubble + namespace labels	Missing access label
Gateway returns 503	Hubble from istio-gateway	Missing profile label

Common Failure Chains

Storage failures cascade:

StorageClass missing â PVC Pending â Pod Pending â Helm timeout

Network failures cascade:

DNS failure â Service unreachable â Health check fails â Pod restarted

Network policy failures cascade:

Missing namespace profile label â No ingress allowed â Service unreachable from gateway
Missing access label â Can't reach database â App fails health checks â CrashLoopBackOff

Secret failures cascade:

ExternalSecret fails â Secret missing â Pod CrashLoopBackOff

Network Policy Debugging (Cilium + Hubble)

Network policies are ENFORCED – all traffic is implicitly denied unless allowed.

Check for Blocked Traffic

# Setup Hubble access (run once per session)
KUBECONFIG=~/.kube/<cluster>.yaml kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &

# See dropped traffic in a namespace
hubble observe --verdict DROPPED --namespace <namespace> --since 5m

# See what's trying to reach a service
hubble observe --to-namespace <namespace> --verdict DROPPED --since 5m

# Check specific traffic flow
hubble observe --from-namespace <source> --to-namespace <dest> --since 5m

Common Network Policy Issues

Symptom	Check	Fix
Service unreachable from gateway	`kubectl get ns <ns> --show-labels`	Add profile label
Can’t reach database	Check `access.network-policy.homelab/postgres` label	Add access label
Pods can’t resolve DNS	Hubble DNS drops (rare – baseline allows)	Check for custom egress blocking
Inter-pod communication fails	Hubble intra-namespace drops	Baseline should allow – check for overrides

Namespace Labels Checklist

# Check namespace has required labels
KUBECONFIG=~/.kube/<cluster>.yaml kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq

# Required for app namespaces:
# - network-policy.homelab/profile: standard|internal|internal-egress|isolated

# Optional access labels:
# - access.network-policy.homelab/postgres: "true"
# - access.network-policy.homelab/garage-s3: "true"
# - access.network-policy.homelab/kube-api: "true"

Emergency: Disable Network Policies

# Escape hatch - disables enforcement for namespace (triggers alert after 5m)
KUBECONFIG=~/.kube/<cluster>.yaml kubectl label namespace <ns> network-policy.homelab/enforcement=disabled

# Re-enable after fixing
KUBECONFIG=~/.kube/<cluster>.yaml kubectl label namespace <ns> network-policy.homelab/enforcement-

See docs/runbooks/network-policy-escape-hatch.md for full procedure.

Kickstarting Stalled HelmReleases

HelmReleases can get stuck in a Stalled state with RetriesExceeded even after the underlying issue is resolved. This happens because:

The HR hit its retry limit (default: 4 attempts)
The failure counter persists even if pods are now healthy
Flux won’t auto-retry once Stalled condition is set

Symptoms:

STATUS: Stalled
MESSAGE: Failed to install after 4 attempt(s)
REASON: RetriesExceeded

Diagnosis: Check if the underlying resources are actually healthy:

# HR shows Failed, but check if pods are running
KUBECONFIG=~/.kube/<cluster>.yaml kubectl get pods -n <namespace> -l app.kubernetes.io/name=<app>

# If pods are Running but HR is Stalled, the HR just needs a reset

Fix: Suspend and resume to reset the failure counter:

KUBECONFIG=~/.kube/<cluster>.yaml flux suspend helmrelease <name> -n flux-system
KUBECONFIG=~/.kube/<cluster>.yaml flux resume helmrelease <name> -n flux-system

Common causes of initial failure (that may have self-healed):

Missing Secret/ConfigMap (ExternalSecret eventually created it)
Missing CRD (operator finished installing)
Transient network issues during image pull
Resource quota temporarily exceeded

Prevention: Ensure proper dependsOn ordering so prerequisites are ready before HelmRelease installs.

Promotion Pipeline Debugging

Symptom: “Live cluster not updating after merge”

The OCI artifact promotion pipeline has multiple stages where failures can stall deployment. Walk through each stage in order to find where the pipeline is stuck.

Diagnostic Steps

1. PR merged to main
   ââ Check: Did build-platform-artifact.yaml trigger?
      ââ GitHub Actions â "Build Platform Artifact" workflow
      ââ If missing: Was kubernetes/ modified? (paths filter)

2. OCI artifact built
   ââ Check: Is the artifact in GHCR with integration-* tag?
      ââ flux list artifact oci://ghcr.io/<repo>/platform | grep integration

3. Integration cluster picks up artifact
   ââ Check: Does OCIRepository see the new version?
      ââ KUBECONFIG=~/.kube/integration.yaml kubectl get ocirepository -n flux-system
      ââ Look at .status.artifact.revision â does it match the new RC tag?
      ââ If stale: Check semver constraint (must be ">= 0.0.0-0" to accept RCs)

4. Integration reconciliation succeeds
   ââ Check: Platform Kustomization healthy?
      ââ KUBECONFIG=~/.kube/integration.yaml flux get kustomizations -n flux-system
      ââ If failed: Read Kustomization events for the error

5. Flux Alert fires repository_dispatch
   ââ Check: Did the Alert fire?
      ââ KUBECONFIG=~/.kube/integration.yaml kubectl describe alert validation-success -n flux-system
      ââ Check Provider (GitHub) status:
         KUBECONFIG=~/.kube/integration.yaml kubectl get providers -n flux-system

6. tag-validated-artifact.yaml runs
   ââ Check: Did the workflow trigger?
      ââ GitHub Actions â "Tag Validated Artifact" workflow
      ââ If not triggered: repository_dispatch may not have fired
         (check Provider secret has repo scope)
      ââ If triggered but failed: Check workflow logs for tagging errors

7. Live cluster picks up validated artifact
   ââ Check: Does OCIRepository see the stable semver?
      ââ KUBECONFIG=~/.kube/live.yaml kubectl get ocirepository -n flux-system
      ââ Semver constraint must be ">= 0.0.0" (stable only, no RCs)

Common Failure Modes

Stage	Symptom	Common Cause
Build	Workflow did not trigger	`kubernetes/` not in changed paths
Build	Artifact push failed	GHCR auth issue (`GITHUB_TOKEN` permissions)
Integration	OCIRepository not updating	Semver constraint mismatch (not accepting RCs)
Validation	Kustomization failed	Actual config error in the merged PR
Promotion	`repository_dispatch` not received	Provider secret missing `repo` scope
Promotion	Workflow skipped (idempotency guard)	Artifact already tagged as validated
Live	OCIRepository not updating	Stable semver tag not created by tag workflow

Manual Promotion (Emergency)

If the pipeline is stuck and live needs the update:

# Authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u $GITHUB_USER --password-stdin

# Find the integration artifact
flux list artifact oci://ghcr.io/<repo>/platform | grep integration

# Manually tag as validated + stable semver
flux tag artifact oci://ghcr.io/<repo>/platform:<rc-tag> --tag <stable-semver>

See .github/CLAUDE.md for full pipeline architecture and rollback procedures.

Common Confusions

BAD: Jump to logs without checking events first GOOD: Events provide context, then investigate logs

BAD: Look only at current pod state GOOD: Check --previous logs if pod restarted

BAD: Assume first error is root cause GOOD: Apply 5 Whys to find true root cause

BAD: Investigate without confirming cluster GOOD: ALWAYS confirm cluster before any kubectl command

Keywords

kubernetes, debugging, crashloopbackoff, oomkilled, pending, root cause analysis, 5 whys, incident investigation, pod logs, events, troubleshooting, network policy, hubble, stalled helmrelease, promotion pipeline, live not updating

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台