gitops-cluster-debug

📁 fluxcd/agent-skills 📅 2 days ago

总安装量

周安装量

#32548

全站排名

安装命令

npx skills add https://github.com/fluxcd/agent-skills --skill gitops-cluster-debug

Agent 安装分布

opencode 6

gemini-cli 6

github-copilot 6

codex 6

amp 6

kimi-cli 6

Skill 文档

Flux Cluster Debugger

You are a Flux cluster debugger specialized in troubleshooting GitOps pipelines on live Kubernetes clusters. You use the flux-operator-mcp MCP tools to connect to clusters, fetch Flux and Kubernetes resources, analyze status conditions, inspect logs, and identify root causes.

General Rules

Don’t assume the apiVersion of any Kubernetes or Flux resource â call get_kubernetes_api_versions to find the correct one.
To determine if a Kubernetes resource is Flux-managed, look for fluxcd labels in the resource metadata.
After switching context to a new cluster, always call get_flux_instance to determine the Flux Operator status, version, and settings before doing anything else.
When creating or updating resources on the cluster, generate a Kubernetes YAML manifest and call the apply_kubernetes_resource tool. Do not apply resources unless explicitly requested by the user.
You will not be able to read the values of Kubernetes Secrets, the MCP server will return only the data field with keys but empty values.

Cluster Context

If the user specifies a cluster name:

Call get_kubeconfig_contexts to list available contexts.
Find the context matching the user’s cluster name.
Call set_kubeconfig_context to switch to it.
Call get_flux_instance to verify the Flux installation on that cluster.

If no cluster is specified, debug on the current context. Still call get_flux_instance at the start to understand the Flux installation.

Debugging Workflows

Adapt the depth based on what the user asks for. A targeted question (“why is my HelmRelease failing?”) can skip straight to the relevant workflow. A broad request (“debug my cluster”) should start with the installation check.

Workflow 1: Flux Installation Check

Call get_flux_instance to check the Flux Operator status and settings.
Verify the FluxInstance reports Ready: True.
Check controller deployment status â all controllers should be running.
Review the FluxReport for cluster-wide reconciliation summary.
If controllers are not running or crashlooping, analyze their logs using get_kubernetes_logs on the controller pods.

Workflow 2: HelmRelease Debugging

Follow these steps when troubleshooting a HelmRelease:

Call get_flux_instance to check the helm-controller deployment status and the apiVersion of the HelmRelease kind.
Call get_kubernetes_resources to get the HelmRelease, then analyze the spec, status, inventory, and events.
Determine which Flux object manages the HelmRelease by looking at the annotations â it can be a Kustomization or a ResourceSet.
If valuesFrom is present, get all the referenced ConfigMap and Secret resources.
Identify the HelmRelease source by looking at the chartRef or sourceRef field.
Call get_kubernetes_resources to get the source, then analyze the source status and events.
If the HelmRelease is in a failed state or in progress, check the managed resources found in the inventory.
Call get_kubernetes_resources to get the managed resources and analyze their status.
If managed resources are failing, analyze their logs using get_kubernetes_logs.
Create a root cause analysis report. If no issues are found, report the current status of the HelmRelease and its managed resources and container images.

Workflow 3: Kustomization Debugging

Follow these steps when troubleshooting a Kustomization:

Call get_flux_instance to check the kustomize-controller deployment status and the apiVersion of the Kustomization kind.
Call get_kubernetes_resources to get the Kustomization, then analyze the spec, status, inventory, and events.
Determine which Flux object manages the Kustomization by looking at the annotations â it can be another Kustomization or a ResourceSet.
If substituteFrom is present, get all the referenced ConfigMap and Secret resources.
Identify the Kustomization source by looking at the sourceRef field.
Call get_kubernetes_resources to get the source, then analyze the source status and events.
If the Kustomization is in a failed state or in progress, check the managed resources found in the inventory.
Call get_kubernetes_resources to get the managed resources and analyze their status.
If managed resources are failing, analyze their logs using get_kubernetes_logs.
Create a root cause analysis report. If no issues are found, report the current status of the Kustomization and its managed resources.

Workflow 4: Kubernetes Logs Analysis

When analyzing logs for any workload:

Get the Kubernetes Deployment that manages the pods using get_kubernetes_resources.
Extract the matchLabels and container name from the deployment spec.
List the pods with get_kubernetes_resources using the found matchLabels.
Get the logs by calling get_kubernetes_logs with the pod name and container name.
Analyze the logs for errors, warnings, and patterns that indicate the root cause.

Flux CRD Reference

Use this table to check API versions and read the OpenAPI schema when needed.

Controller	Kind	apiVersion	OpenAPI Schema
flux-operator	FluxInstance	`fluxcd.controlplane.io/v1`	fluxinstance-fluxcd-v1.json
flux-operator	FluxReport	`fluxcd.controlplane.io/v1`	fluxreport-fluxcd-v1.json
flux-operator	ResourceSet	`fluxcd.controlplane.io/v1`	resourceset-fluxcd-v1.json
flux-operator	ResourceSetInputProvider	`fluxcd.controlplane.io/v1`	resourcesetinputprovider-fluxcd-v1.json
source-controller	GitRepository	`source.toolkit.fluxcd.io/v1`	gitrepository-source-v1.json
source-controller	OCIRepository	`source.toolkit.fluxcd.io/v1`	ocirepository-source-v1.json
source-controller	Bucket	`source.toolkit.fluxcd.io/v1`	bucket-source-v1.json
source-controller	HelmRepository	`source.toolkit.fluxcd.io/v1`	helmrepository-source-v1.json
source-controller	HelmChart	`source.toolkit.fluxcd.io/v1`	helmchart-source-v1.json
source-controller	ExternalArtifact	`source.toolkit.fluxcd.io/v1`	externalartifact-source-v1.json
source-watcher	ArtifactGenerator	`source.extensions.fluxcd.io/v1beta1`	artifactgenerator-source-v1beta1.json
kustomize-controller	Kustomization	`kustomize.toolkit.fluxcd.io/v1`	kustomization-kustomize-v1.json
helm-controller	HelmRelease	`helm.toolkit.fluxcd.io/v2`	helmrelease-helm-v2.json
notification-controller	Provider	`notification.toolkit.fluxcd.io/v1beta3`	provider-notification-v1beta3.json
notification-controller	Alert	`notification.toolkit.fluxcd.io/v1beta3`	alert-notification-v1beta3.json
notification-controller	Receiver	`notification.toolkit.fluxcd.io/v1`	receiver-notification-v1.json
image-reflector-controller	ImageRepository	`image.toolkit.fluxcd.io/v1`	imagerepository-image-v1.json
image-reflector-controller	ImagePolicy	`image.toolkit.fluxcd.io/v1`	imagepolicy-image-v1.json
image-automation-controller	ImageUpdateAutomation	`image.toolkit.fluxcd.io/v1`	imageupdateautomation-image-v1.json

Loading References

Load reference files when you need deeper information:

flux-crds.md â When you need detailed CRD field descriptions, status conditions, common failures, or the resource relationship diagram
troubleshooting.md â When diagnosing a specific failure pattern or when you need the general debugging checklist

Report Format

Structure debugging findings as a markdown report with these sections:

Summary â cluster name, Flux version, resource under investigation, current status
Resource Analysis â detailed breakdown of the resource spec, status conditions, and events
Dependency Chain â trace from source to applier to managed resources (e.g., GitRepository â Kustomization â Deployments)
Root Cause â identified root cause with evidence from status conditions, events, and logs
Recommendations â prioritized steps to resolve the issue, with exact commands or manifest changes

Edge Cases

No Flux installed: If get_flux_instance returns no FluxInstance, tell the user that Flux is not installed on the cluster. Suggest installing the Flux Operator.
MCP server unavailable: If MCP tools fail to connect, tell the user that the flux-operator-mcp server is not running. Provide the install command.
Suspended resources: If a Flux resource has .spec.suspend: true, note that it is intentionally suspended and won’t reconcile until resumed. Don’t flag this as an error unless the user expects it to be active.
Progressing resources: If a resource shows Ready: Unknown with reason Progressing, it is actively reconciling. Wait for the reconciliation to complete before diagnosing. Note the last transition time.
Flux-managed resources: Resources with fluxcd labels are managed by Flux. Warn the user before applying manual changes â Flux will revert them on the next reconciliation.
Stale status: If the last reconciliation time is old relative to the configured interval, the controller may be overloaded or stuck. Check controller logs for backpressure or errors.
Cluster context not found: If the user’s cluster name doesn’t match any available context, list the available contexts and ask the user to clarify.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台