monitoring-authoring
npx skills add https://github.com/ionfury/homelab --skill monitoring-authoring
Agent 安装分布
Skill 文档
Monitoring Resource Authoring
This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.
Resource Types Overview
| Resource | API Group | Purpose | CRD Provider |
|---|---|---|---|
PrometheusRule |
monitoring.coreos.com/v1 |
Alert rules and recording rules | kube-prometheus-stack |
ServiceMonitor |
monitoring.coreos.com/v1 |
Scrape metrics from Services | kube-prometheus-stack |
PodMonitor |
monitoring.coreos.com/v1 |
Scrape metrics from Pods directly | kube-prometheus-stack |
ScrapeConfig |
monitoring.coreos.com/v1alpha1 |
Advanced scrape configuration (relabeling, multi-target) | kube-prometheus-stack |
AlertmanagerConfig |
monitoring.coreos.com/v1alpha1 |
Routing, receivers, silencing | kube-prometheus-stack |
Silence |
observability.giantswarm.io/v1alpha2 |
Declarative Alertmanager silences | silence-operator |
Canary |
canaries.flanksource.com/v1 |
Synthetic health checks (HTTP, TCP, K8s) | canary-checker |
File Placement
Monitoring resources go in different locations depending on scope:
| Scope | Path | When to Use |
|---|---|---|
| Platform-wide alerts/monitors | kubernetes/platform/config/monitoring/ |
Alerts for platform components (Cilium, Istio, cert-manager, etc.) |
| Subsystem-specific alerts | kubernetes/platform/config/<subsystem>/ |
Alerts bundled with the subsystem they monitor (e.g., dragonfly/prometheus-rules.yaml) |
| Cluster-specific silences | kubernetes/clusters/<cluster>/config/silences/ |
Silences for known issues on specific clusters |
| Cluster-specific alerts | kubernetes/clusters/<cluster>/config/ |
Alerts that only apply to a specific cluster |
| Canary health checks | kubernetes/platform/config/canary-checker/ |
Platform-wide synthetic checks |
File Naming Conventions
Observed patterns in the config/monitoring/ directory:
| Pattern | Example | When |
|---|---|---|
<component>-alerts.yaml |
cilium-alerts.yaml, grafana-alerts.yaml |
PrometheusRule files |
<component>-recording-rules.yaml |
loki-mixin-recording-rules.yaml |
Recording rules |
<component>-servicemonitors.yaml |
istio-servicemonitors.yaml |
ServiceMonitor/PodMonitor files |
<component>-canary.yaml |
alertmanager-canary.yaml |
Canary health checks |
<component>-route.yaml |
grafana-route.yaml |
HTTPRoute for gateway access |
<component>-secret.yaml |
discord-secret.yaml |
ExternalSecrets for monitoring |
<component>-scrape.yaml |
hardware-monitoring-scrape.yaml |
ScrapeConfig resources |
Registration
After creating a file in config/monitoring/, add it to the kustomization:
# kubernetes/platform/config/monitoring/kustomization.yaml
resources:
- ...existing resources...
- my-new-alerts.yaml # Add alphabetically by component
For subsystem-specific alerts (e.g., config/dragonfly/prometheus-rules.yaml), add to that
subsystem’s kustomization.yaml instead.
PrometheusRule Authoring
Required Structure
Every PrometheusRule must include the release: kube-prometheus-stack label for Prometheus
to discover it. The YAML schema comment enables editor validation.
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/prometheusrule_v1.json
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: <component>-alerts
labels:
app.kubernetes.io/name: <component>
release: kube-prometheus-stack # REQUIRED - Prometheus selector
spec:
groups:
- name: <component>.rules # or <component>-<subsystem> for sub-groups
rules:
- alert: AlertName
expr: <PromQL expression>
for: 5m
labels:
severity: critical # critical | warning | info
annotations:
summary: "Short human-readable summary with {{ $labels.instance }}"
description: >-
Detailed explanation of what is happening, what it means,
and what to investigate. Use template variables for context.
Label Requirements
| Label | Required | Purpose |
|---|---|---|
release: kube-prometheus-stack |
Yes | Prometheus discovery selector |
app.kubernetes.io/name: <component> |
Recommended | Organizational grouping |
Some files use additional labels like prometheus: kube-prometheus-stack (e.g., dragonfly),
but release: kube-prometheus-stack is the critical one for discovery.
Severity Conventions
| Severity | for Duration |
Use Case | Alertmanager Routing |
|---|---|---|---|
critical |
2m-5m | Service down, data loss risk, immediate action needed | Routed to Discord |
warning |
5m-15m | Degraded performance, approaching limits, needs attention | Default receiver (Discord) |
info |
10m-30m | Informational, capacity planning, non-urgent | Silenced by InfoInhibitor |
Guidelines for for duration:
- Shorter
for= faster alert, more noise. Longer = quieter, slower response. for: 0m(immediate) only for truly instant failures (e.g., SMART health check fail).- Most alerts: 5m is a good default.
- Flap-prone metrics (error rates, latency): 10m-15m to avoid false positives.
- Absence detection: 5m (metric may genuinely disappear briefly during restarts).
Annotation Templates
Standard annotations used across this repository:
annotations:
summary: "Short title with {{ $labels.relevant_label }}"
description: >-
Multi-line description explaining what happened, the impact,
and what to investigate. Reference threshold values and current
values using template functions.
runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"
The runbook_url annotation is optional but recommended for critical alerts that have
established recovery procedures.
PromQL Template Functions
Functions available in summary and description annotations:
| Function | Input | Output | Example |
|---|---|---|---|
humanize |
Number | Human-readable number | {{ $value | humanize }} -> “1.234k” |
humanizePercentage |
Float (0-1) | Percentage string | {{ $value | humanizePercentage }} -> “45.6%” |
humanizeDuration |
Seconds | Duration string | {{ $value | humanizeDuration }} -> “2h 30m” |
printf |
Format string | Formatted value | {{ printf "%.2f" $value }} -> “1.23” |
Label Variables in Annotations
Access alert labels via {{ $labels.<label_name> }} and the expression value via {{ $value }}:
summary: "Cilium agent down on {{ $labels.instance }}"
description: >-
BPF map {{ $labels.map_name }} on {{ $labels.instance }} is at
{{ $value | humanizePercentage }}.
Common Alert Patterns
Target down (availability):
- alert: <Component>Down
expr: up{job="<job-name>"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "<Component> is down on {{ $labels.instance }}"
Absence detection (component disappeared entirely):
- alert: <Component>Down
expr: absent(up{job="<job-name>"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "<Component> is unavailable"
Error rate (ratio):
- alert: <Component>HighErrorRate
expr: |
(
sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="<job>"}[5m]))
) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "<Component> error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
Latency (histogram quantile):
- alert: <Component>HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "<Component> p99 latency above 1s"
description: "P99 latency is {{ $value | humanizeDuration }}"
Resource pressure (capacity):
- alert: <Component>ResourcePressure
expr: <resource_used> / <resource_total> > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "<Component> at {{ $value | humanizePercentage }} capacity"
PVC space low:
- alert: <Component>PVCLow
expr: |
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
< 0.15
for: 15m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} running low"
description: "{{ $value | humanizePercentage }} free space remaining"
Alert Grouping
Group related alerts in named rule groups. The name field groups alerts in the Prometheus
UI and affects evaluation order:
spec:
groups:
- name: cilium-agent # Agent availability and health
rules: [...]
- name: cilium-bpf # BPF subsystem alerts
rules: [...]
- name: cilium-policy # Network policy alerts
rules: [...]
- name: cilium-network # General networking alerts
rules: [...]
Recording Rules
Recording rules pre-compute expensive queries for dashboard performance. Place them alongside
alerts in the same PrometheusRule file or in a dedicated *-recording-rules.yaml file.
spec:
groups:
- name: <component>-recording-rules
rules:
- record: <namespace>:<metric>:<aggregation>
expr: |
<PromQL aggregation query>
Naming Convention
Recording rule names follow the pattern level:metric:operations:
loki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5m
When to Create Recording Rules
- Dashboard queries that aggregate across many series (e.g., sum/rate across all pods)
- Queries used by multiple alerts (avoids redundant computation)
- Complex multi-step computations that are hard to read inline
Example: Loki Recording Rules
- record: loki:request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
)
- record: loki:requests_error_rate:ratio5m
expr: |
sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
/
sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)
ServiceMonitor and PodMonitor
Via Helm Values (Preferred)
Most charts support enabling ServiceMonitor through values. Always prefer this over manual resources:
# kubernetes/platform/charts/<app-name>.yaml
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
Manual ServiceMonitor
When a chart does not support ServiceMonitor creation, create one manually. The resource
lives in the monitoring namespace and uses namespaceSelector to reach across namespaces.
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # REQUIRED for discovery
spec:
namespaceSelector:
matchNames:
- <target-namespace> # Namespace where the service lives
selector:
matchLabels:
app.kubernetes.io/name: <component> # Must match service labels
endpoints:
- port: http-monitoring # Must match service port name
path: /metrics
interval: 30s
Manual PodMonitor
Use PodMonitor when pods expose metrics but don’t have a Service (e.g., DaemonSets, sidecars):
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/podmonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # REQUIRED for discovery
spec:
namespaceSelector:
matchNames:
- <target-namespace>
selector:
matchLabels:
app: <component>
podMetricsEndpoints:
- port: "15020" # Port name or number (quoted if numeric)
path: /stats/prometheus
interval: 30s
Cross-Namespace Pattern
All ServiceMonitors and PodMonitors in this repo live in the monitoring namespace and use
namespaceSelector to reach pods in other namespaces. This centralizes monitoring configuration
and avoids needing release: kube-prometheus-stack labels on resources in app namespaces.
Advanced: matchExpressions
For selecting multiple pod labels (e.g., all Flux controllers):
selector:
matchExpressions:
- key: app
operator: In
values:
- helm-controller
- source-controller
- kustomize-controller
AlertmanagerConfig
The platform Alertmanager configuration lives in config/monitoring/alertmanager-config.yaml.
It defines routing and receivers for the entire platform.
Current Routing Architecture
All alerts
âââ InfoInhibitor â null receiver (silenced)
âââ Watchdog â heartbeat receiver (webhook to healthchecks.io, every 2m)
âââ severity=critical â discord receiver
âââ (default) â discord receiver
Receivers
| Receiver | Type | Purpose |
|---|---|---|
"null" |
None | Silences matched alerts (e.g., InfoInhibitor) |
heartbeat |
Webhook | Sends Watchdog heartbeat to healthchecks.io |
discord |
Discord webhook | Sends alerts to Discord channel |
Adding a New Route
To route specific alerts differently (e.g., to a different channel or receiver), add a route
entry in the alertmanager-config.yaml:
routes:
- receiver: "<receiver-name>"
matchers:
- name: alertname
value: "<AlertName>"
matchType: =
Secrets for Alertmanager
| Secret | Source | File |
|---|---|---|
alertmanager-discord-webhook |
ExternalSecret (AWS SSM) | discord-secret.yaml |
alertmanager-heartbeat-ping-url |
Replicated from kube-system |
heartbeat-secret.yaml |
Silence CRs (silence-operator)
Silences suppress known alerts declaratively. They are per-cluster resources because different clusters have different expected alert profiles.
Placement
kubernetes/clusters/<cluster>/config/silences/
âââ kustomization.yaml
âââ <descriptive-name>.yaml
Template
---
# <Comment explaining WHY this alert is silenced>
apiVersion: observability.giantswarm.io/v1alpha2
kind: Silence
metadata:
name: <descriptive-name>
namespace: monitoring
spec:
matchers:
- name: alertname
matchType: "=~" # "=" exact, "=~" regex, "!=" negation, "!~" regex negation
value: "Alert1|Alert2"
- name: namespace
matchType: "="
value: <target-namespace>
Matcher Reference
| matchType | Meaning | Example |
|---|---|---|
= |
Exact match | value: "KubePodCrashLooping" |
!= |
Not equal | value: "Watchdog" |
=~ |
Regex match | value: "KubePod.*|TargetDown" |
!~ |
Regex negation | value: "Info.*" |
Requirements
- Always include a comment explaining why the silence exists (architectural limitation, expected behavior, etc.)
- Every cluster must maintain a zero firing alerts baseline (excluding Watchdog)
- Silences are a LAST RESORT â every effort must be made to fix the root cause before resorting to a silence. Only silence when the alert genuinely cannot be fixed: architectural limitations (e.g., single-node Spegel), expected environmental behavior, or confirmed upstream bugs
- Never leave alerts firing without action â either fix the cause or create a Silence CR. An ignored alert degrades trust in the entire monitoring system and leads to alert fatigue where real incidents get missed
Adding a Silence to a Cluster
- Create
config/silences/directory if it does not exist - Add the Silence YAML file
- Create or update
config/silences/kustomization.yaml:apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - <silence-name>.yaml - Reference
silencesinconfig/kustomization.yaml
Canary Health Checks
Canary resources provide synthetic monitoring using Flanksource canary-checker.
They live in config/canary-checker/ for platform checks or alongside app config for app-specific checks.
HTTP Health Check
---
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: http-check-<component>
spec:
schedule: "@every 1m"
http:
- name: <component>-health
url: https://<component>.${internal_domain}/health
responseCodes: [200]
maxSSLExpiry: 7 # Alert if TLS cert expires within 7 days
thresholdMillis: 5000 # Fail if response takes >5s
TCP Port Check
spec:
schedule: "@every 1m"
tcp:
- name: <component>-port
host: <service>.<namespace>.svc.cluster.local
port: 8080
timeout: 5000
Kubernetes Resource Check with CEL
Test that pods are actually healthy using CEL expressions (preferred over ready: true
because the built-in flag penalizes pods with restart history):
spec:
interval: 60
kubernetes:
- name: <component>-pods-healthy
kind: Pod
namespaceSelector:
name: <namespace>
resource:
labelSelector: app.kubernetes.io/name=<component>
test:
expr: >
dyn(results).all(pod,
pod.Object.status.phase == "Running" &&
pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
)
Canary Metrics and Alerting
canary-checker exposes metrics that are already monitored by the platform:
canary_check == 1triggersCanaryCheckFailure(critical, 2m)- High failure rates trigger
CanaryCheckHighFailureRate(warning, 5m)
These alerts are defined in config/canary-checker/prometheus-rules.yaml — you do not
need to create separate alerts for each canary.
Workflow: Adding Monitoring for a New Component
Step 1: Determine What Exists
Check if the Helm chart already provides monitoring:
# Search chart values for monitoring options
kubesearch <chart-name> serviceMonitor
kubesearch <chart-name> prometheusRule
Enable via Helm values if available (see deploy-app skill).
Step 2: Create Missing Resources
If the chart does not provide monitoring, create resources manually:
- ServiceMonitor or PodMonitor for metrics scraping
- PrometheusRule for alert rules
- Canary for synthetic health checks (HTTP/TCP)
Step 3: Place Files Correctly
- If the component has its own config subsystem (
config/<component>/), add monitoring resources there alongside other config - If it is a standalone monitoring addition, add to
config/monitoring/
Step 4: Register in Kustomization
Add new files to the appropriate kustomization.yaml.
Step 5: Validate
task k8s:validate
Step 6: Verify After Deployment
Prometheus is behind OAuth2 Proxy â use kubectl exec or port-forward for API queries:
# Check ServiceMonitor is discovered
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/targets' | \
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
# Check alert rules are loaded
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/rules' | \
jq '.data.groups[] | select(.name | contains("<component>"))'
# Check canary status
KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
Missing release: kube-prometheus-stack label |
Prometheus ignores the resource | Add the label to metadata.labels |
| PrometheusRule in wrong namespace without namespaceSelector | Prometheus does not discover it | Place in monitoring namespace or ensure Prometheus watches the target namespace |
| ServiceMonitor selector does not match any service | No metrics scraped, no error raised | Verify labels match with kubectl get svc -n <ns> --show-labels |
Using ready: true in canary-checker Kubernetes checks |
False negatives after pod restarts | Use CEL test.expr instead |
| Hardcoding domains in canary URLs | Breaks across clusters | Use ${internal_domain} substitution variable |
Very short for duration on flappy metrics |
Alert noise | Use 10m+ for error rates and latencies |
| Creating alerts for metrics that do not exist yet | Alert permanently in “pending” state | Verify metrics exist in Prometheus before writing rules |
Reference: Existing Alert Files
| File | Component | Alert Count | Subsystem |
|---|---|---|---|
monitoring/cilium-alerts.yaml |
Cilium | 14 | Agent, BPF, Policy, Network |
monitoring/istio-alerts.yaml |
Istio | ~10 | Control plane, mTLS, Gateway |
monitoring/cert-manager-alerts.yaml |
cert-manager | 5 | Expiry, Renewal, Issuance |
monitoring/network-policy-alerts.yaml |
Network Policy | 2 | Enforcement escape hatch |
monitoring/external-secrets-alerts.yaml |
External Secrets | 3 | Sync, Ready, Store health |
monitoring/grafana-alerts.yaml |
Grafana | 4 | Datasource, Errors, Plugins, Down |
monitoring/loki-mixin-alerts.yaml |
Loki | ~5 | Requests, Latency, Ingester |
monitoring/alloy-alerts.yaml |
Alloy | 3 | Dropped entries, Errors, Lag |
monitoring/hardware-monitoring-alerts.yaml |
Hardware | 7 | Temperature, Fans, Disks, Power |
dragonfly/prometheus-rules.yaml |
Dragonfly | 2+ | Down, Memory |
canary-checker/prometheus-rules.yaml |
canary-checker | 2 | Check failure, High failure rate |
Keywords
PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat