monitoring-authoring

📁 ionfury/homelab 📅 3 days ago

总安装量

周安装量

#31853

全站排名

安装命令

npx skills add https://github.com/ionfury/homelab --skill monitoring-authoring

Agent 安装分布

opencode 8

gemini-cli 8

amp 8

cline 8

github-copilot 8

codex 8

Skill 文档

Monitoring Resource Authoring

This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.

Resource Types Overview

Resource	API Group	Purpose	CRD Provider
`PrometheusRule`	`monitoring.coreos.com/v1`	Alert rules and recording rules	kube-prometheus-stack
`ServiceMonitor`	`monitoring.coreos.com/v1`	Scrape metrics from Services	kube-prometheus-stack
`PodMonitor`	`monitoring.coreos.com/v1`	Scrape metrics from Pods directly	kube-prometheus-stack
`ScrapeConfig`	`monitoring.coreos.com/v1alpha1`	Advanced scrape configuration (relabeling, multi-target)	kube-prometheus-stack
`AlertmanagerConfig`	`monitoring.coreos.com/v1alpha1`	Routing, receivers, silencing	kube-prometheus-stack
`Silence`	`observability.giantswarm.io/v1alpha2`	Declarative Alertmanager silences	silence-operator
`Canary`	`canaries.flanksource.com/v1`	Synthetic health checks (HTTP, TCP, K8s)	canary-checker

File Placement

Monitoring resources go in different locations depending on scope:

Scope	Path	When to Use
Platform-wide alerts/monitors	`kubernetes/platform/config/monitoring/`	Alerts for platform components (Cilium, Istio, cert-manager, etc.)
Subsystem-specific alerts	`kubernetes/platform/config/<subsystem>/`	Alerts bundled with the subsystem they monitor (e.g., `dragonfly/prometheus-rules.yaml`)
Cluster-specific silences	`kubernetes/clusters/<cluster>/config/silences/`	Silences for known issues on specific clusters
Cluster-specific alerts	`kubernetes/clusters/<cluster>/config/`	Alerts that only apply to a specific cluster
Canary health checks	`kubernetes/platform/config/canary-checker/`	Platform-wide synthetic checks

File Naming Conventions

Observed patterns in the config/monitoring/ directory:

Pattern	Example	When
`<component>-alerts.yaml`	`cilium-alerts.yaml`, `grafana-alerts.yaml`	PrometheusRule files
`<component>-recording-rules.yaml`	`loki-mixin-recording-rules.yaml`	Recording rules
`<component>-servicemonitors.yaml`	`istio-servicemonitors.yaml`	ServiceMonitor/PodMonitor files
`<component>-canary.yaml`	`alertmanager-canary.yaml`	Canary health checks
`<component>-route.yaml`	`grafana-route.yaml`	HTTPRoute for gateway access
`<component>-secret.yaml`	`discord-secret.yaml`	ExternalSecrets for monitoring
`<component>-scrape.yaml`	`hardware-monitoring-scrape.yaml`	ScrapeConfig resources

Registration

After creating a file in config/monitoring/, add it to the kustomization:

# kubernetes/platform/config/monitoring/kustomization.yaml
resources:
  - ...existing resources...
  - my-new-alerts.yaml    # Add alphabetically by component

For subsystem-specific alerts (e.g., config/dragonfly/prometheus-rules.yaml), add to that subsystem’s kustomization.yaml instead.

PrometheusRule Authoring

Required Structure

Every PrometheusRule must include the release: kube-prometheus-stack label for Prometheus to discover it. The YAML schema comment enables editor validation.

---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/prometheusrule_v1.json
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <component>-alerts
  labels:
    app.kubernetes.io/name: <component>
    release: kube-prometheus-stack    # REQUIRED - Prometheus selector
spec:
  groups:
    - name: <component>.rules        # or <component>-<subsystem> for sub-groups
      rules:
        - alert: AlertName
          expr: <PromQL expression>
          for: 5m
          labels:
            severity: critical        # critical | warning | info
          annotations:
            summary: "Short human-readable summary with {{ $labels.instance }}"
            description: >-
              Detailed explanation of what is happening, what it means,
              and what to investigate. Use template variables for context.

Label Requirements

Label	Required	Purpose
`release: kube-prometheus-stack`	Yes	Prometheus discovery selector
`app.kubernetes.io/name: <component>`	Recommended	Organizational grouping

Some files use additional labels like prometheus: kube-prometheus-stack (e.g., dragonfly), but release: kube-prometheus-stack is the critical one for discovery.

Severity Conventions

Severity	`for` Duration	Use Case	Alertmanager Routing
`critical`	2m-5m	Service down, data loss risk, immediate action needed	Routed to Discord
`warning`	5m-15m	Degraded performance, approaching limits, needs attention	Default receiver (Discord)
`info`	10m-30m	Informational, capacity planning, non-urgent	Silenced by InfoInhibitor

Guidelines for for duration:

Shorter for = faster alert, more noise. Longer = quieter, slower response.
for: 0m (immediate) only for truly instant failures (e.g., SMART health check fail).
Most alerts: 5m is a good default.
Flap-prone metrics (error rates, latency): 10m-15m to avoid false positives.
Absence detection: 5m (metric may genuinely disappear briefly during restarts).

Annotation Templates

Standard annotations used across this repository:

annotations:
  summary: "Short title with {{ $labels.relevant_label }}"
  description: >-
    Multi-line description explaining what happened, the impact,
    and what to investigate. Reference threshold values and current
    values using template functions.
  runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"

The runbook_url annotation is optional but recommended for critical alerts that have established recovery procedures.

PromQL Template Functions

Functions available in summary and description annotations:

Function	Input	Output	Example
`humanize`	Number	Human-readable number	`{{ $value \| humanize }}` -> “1.234k”
`humanizePercentage`	Float (0-1)	Percentage string	`{{ $value \| humanizePercentage }}` -> “45.6%”
`humanizeDuration`	Seconds	Duration string	`{{ $value \| humanizeDuration }}` -> “2h 30m”
`printf`	Format string	Formatted value	`{{ printf "%.2f" $value }}` -> “1.23”

Label Variables in Annotations

Access alert labels via {{ $labels.<label_name> }} and the expression value via {{ $value }}:

summary: "Cilium agent down on {{ $labels.instance }}"
description: >-
  BPF map {{ $labels.map_name }} on {{ $labels.instance }} is at
  {{ $value | humanizePercentage }}.

Common Alert Patterns

Target down (availability):

- alert: <Component>Down
  expr: up{job="<job-name>"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> is down on {{ $labels.instance }}"

Absence detection (component disappeared entirely):

- alert: <Component>Down
  expr: absent(up{job="<job-name>"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> is unavailable"

Error rate (ratio):

- alert: <Component>HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="<job>"}[5m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> error rate above 5%"
    description: "Error rate is {{ $value | humanizePercentage }}"

Latency (histogram quantile):

- alert: <Component>HighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
    ) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> p99 latency above 1s"
    description: "P99 latency is {{ $value | humanizeDuration }}"

Resource pressure (capacity):

- alert: <Component>ResourcePressure
  expr: <resource_used> / <resource_total> > 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> at {{ $value | humanizePercentage }} capacity"

PVC space low:

- alert: <Component>PVCLow
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
    /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
    < 0.15
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} running low"
    description: "{{ $value | humanizePercentage }} free space remaining"

Alert Grouping

Group related alerts in named rule groups. The name field groups alerts in the Prometheus UI and affects evaluation order:

spec:
  groups:
    - name: cilium-agent       # Agent availability and health
      rules: [...]
    - name: cilium-bpf         # BPF subsystem alerts
      rules: [...]
    - name: cilium-policy      # Network policy alerts
      rules: [...]
    - name: cilium-network     # General networking alerts
      rules: [...]

Recording Rules

Recording rules pre-compute expensive queries for dashboard performance. Place them alongside alerts in the same PrometheusRule file or in a dedicated *-recording-rules.yaml file.

spec:
  groups:
    - name: <component>-recording-rules
      rules:
        - record: <namespace>:<metric>:<aggregation>
          expr: |
            <PromQL aggregation query>

Naming Convention

Recording rule names follow the pattern level:metric:operations:

loki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5m

When to Create Recording Rules

Dashboard queries that aggregate across many series (e.g., sum/rate across all pods)
Queries used by multiple alerts (avoids redundant computation)
Complex multi-step computations that are hard to read inline

Example: Loki Recording Rules

- record: loki:request_duration_seconds:p99
  expr: |
    histogram_quantile(0.99,
      sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
    )

- record: loki:requests_error_rate:ratio5m
  expr: |
    sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
    /
    sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)

ServiceMonitor and PodMonitor

Via Helm Values (Preferred)

Most charts support enabling ServiceMonitor through values. Always prefer this over manual resources:

# kubernetes/platform/charts/<app-name>.yaml
serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s

Manual ServiceMonitor

When a chart does not support ServiceMonitor creation, create one manually. The resource lives in the monitoring namespace and uses namespaceSelector to reach across namespaces.

---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: <component>
  namespace: monitoring
  labels:
    release: kube-prometheus-stack    # REQUIRED for discovery
spec:
  namespaceSelector:
    matchNames:
      - <target-namespace>            # Namespace where the service lives
  selector:
    matchLabels:
      app.kubernetes.io/name: <component>   # Must match service labels
  endpoints:
    - port: http-monitoring           # Must match service port name
      path: /metrics
      interval: 30s

Manual PodMonitor

Use PodMonitor when pods expose metrics but don’t have a Service (e.g., DaemonSets, sidecars):

---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/podmonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: <component>
  namespace: monitoring
  labels:
    release: kube-prometheus-stack    # REQUIRED for discovery
spec:
  namespaceSelector:
    matchNames:
      - <target-namespace>
  selector:
    matchLabels:
      app: <component>
  podMetricsEndpoints:
    - port: "15020"                   # Port name or number (quoted if numeric)
      path: /stats/prometheus
      interval: 30s

Cross-Namespace Pattern

All ServiceMonitors and PodMonitors in this repo live in the monitoring namespace and use namespaceSelector to reach pods in other namespaces. This centralizes monitoring configuration and avoids needing release: kube-prometheus-stack labels on resources in app namespaces.

Advanced: matchExpressions

For selecting multiple pod labels (e.g., all Flux controllers):

selector:
  matchExpressions:
    - key: app
      operator: In
      values:
        - helm-controller
        - source-controller
        - kustomize-controller

AlertmanagerConfig

The platform Alertmanager configuration lives in config/monitoring/alertmanager-config.yaml. It defines routing and receivers for the entire platform.

Current Routing Architecture

All alerts
  âââ InfoInhibitor â null receiver (silenced)
  âââ Watchdog â heartbeat receiver (webhook to healthchecks.io, every 2m)
  âââ severity=critical â discord receiver
  âââ (default) â discord receiver

Receivers

Receiver	Type	Purpose
`"null"`	None	Silences matched alerts (e.g., InfoInhibitor)
`heartbeat`	Webhook	Sends Watchdog heartbeat to healthchecks.io
`discord`	Discord webhook	Sends alerts to Discord channel

Adding a New Route

To route specific alerts differently (e.g., to a different channel or receiver), add a route entry in the alertmanager-config.yaml:

routes:
  - receiver: "<receiver-name>"
    matchers:
      - name: alertname
        value: "<AlertName>"
        matchType: =

Secrets for Alertmanager

Secret	Source	File
`alertmanager-discord-webhook`	ExternalSecret (AWS SSM)	`discord-secret.yaml`
`alertmanager-heartbeat-ping-url`	Replicated from `kube-system`	`heartbeat-secret.yaml`

Silence CRs (silence-operator)

Silences suppress known alerts declaratively. They are per-cluster resources because different clusters have different expected alert profiles.

Placement

kubernetes/clusters/<cluster>/config/silences/
  âââ kustomization.yaml
  âââ <descriptive-name>.yaml

Template

---
# <Comment explaining WHY this alert is silenced>
apiVersion: observability.giantswarm.io/v1alpha2
kind: Silence
metadata:
  name: <descriptive-name>
  namespace: monitoring
spec:
  matchers:
    - name: alertname
      matchType: "=~"           # "=" exact, "=~" regex, "!=" negation, "!~" regex negation
      value: "Alert1|Alert2"
    - name: namespace
      matchType: "="
      value: <target-namespace>

Matcher Reference

matchType	Meaning	Example
`=`	Exact match	`value: "KubePodCrashLooping"`
`!=`	Not equal	`value: "Watchdog"`
`=~`	Regex match	`value: "KubePod.*\|TargetDown"`
`!~`	Regex negation	`value: "Info.*"`

Requirements

Always include a comment explaining why the silence exists (architectural limitation, expected behavior, etc.)
Every cluster must maintain a zero firing alerts baseline (excluding Watchdog)
Silences are a LAST RESORT â every effort must be made to fix the root cause before resorting to a silence. Only silence when the alert genuinely cannot be fixed: architectural limitations (e.g., single-node Spegel), expected environmental behavior, or confirmed upstream bugs
Never leave alerts firing without action â either fix the cause or create a Silence CR. An ignored alert degrades trust in the entire monitoring system and leads to alert fatigue where real incidents get missed

Adding a Silence to a Cluster

Create config/silences/ directory if it does not exist
Add the Silence YAML file

Create or update config/silences/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - <silence-name>.yaml

Reference silences in config/kustomization.yaml

Canary Health Checks

Canary resources provide synthetic monitoring using Flanksource canary-checker. They live in config/canary-checker/ for platform checks or alongside app config for app-specific checks.

HTTP Health Check

---
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: http-check-<component>
spec:
  schedule: "@every 1m"
  http:
    - name: <component>-health
      url: https://<component>.${internal_domain}/health
      responseCodes: [200]
      maxSSLExpiry: 7           # Alert if TLS cert expires within 7 days
      thresholdMillis: 5000     # Fail if response takes >5s

TCP Port Check

spec:
  schedule: "@every 1m"
  tcp:
    - name: <component>-port
      host: <service>.<namespace>.svc.cluster.local
      port: 8080
      timeout: 5000

Kubernetes Resource Check with CEL

Test that pods are actually healthy using CEL expressions (preferred over ready: true because the built-in flag penalizes pods with restart history):

spec:
  interval: 60
  kubernetes:
    - name: <component>-pods-healthy
      kind: Pod
      namespaceSelector:
        name: <namespace>
      resource:
        labelSelector: app.kubernetes.io/name=<component>
      test:
        expr: >
          dyn(results).all(pod,
            pod.Object.status.phase == "Running" &&
            pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
          )

Canary Metrics and Alerting

canary-checker exposes metrics that are already monitored by the platform:

canary_check == 1 triggers CanaryCheckFailure (critical, 2m)
High failure rates trigger CanaryCheckHighFailureRate (warning, 5m)

These alerts are defined in config/canary-checker/prometheus-rules.yaml — you do not need to create separate alerts for each canary.

Workflow: Adding Monitoring for a New Component

Step 1: Determine What Exists

Check if the Helm chart already provides monitoring:

# Search chart values for monitoring options
kubesearch <chart-name> serviceMonitor
kubesearch <chart-name> prometheusRule

Enable via Helm values if available (see deploy-app skill).

Step 2: Create Missing Resources

If the chart does not provide monitoring, create resources manually:

ServiceMonitor or PodMonitor for metrics scraping
PrometheusRule for alert rules
Canary for synthetic health checks (HTTP/TCP)

Step 3: Place Files Correctly

If the component has its own config subsystem (config/<component>/), add monitoring resources there alongside other config
If it is a standalone monitoring addition, add to config/monitoring/

Step 4: Register in Kustomization

Add new files to the appropriate kustomization.yaml.

Step 5: Validate

task k8s:validate

Step 6: Verify After Deployment

Prometheus is behind OAuth2 Proxy â use kubectl exec or port-forward for API queries:

# Check ServiceMonitor is discovered
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/targets' | \
  jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'

# Check alert rules are loaded
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/rules' | \
  jq '.data.groups[] | select(.name | contains("<component>"))'

# Check canary status
KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>

Common Mistakes

Mistake	Impact	Fix
Missing `release: kube-prometheus-stack` label	Prometheus ignores the resource	Add the label to metadata.labels
PrometheusRule in wrong namespace without namespaceSelector	Prometheus does not discover it	Place in `monitoring` namespace or ensure Prometheus watches the target namespace
ServiceMonitor selector does not match any service	No metrics scraped, no error raised	Verify labels match with `kubectl get svc -n <ns> --show-labels`
Using `ready: true` in canary-checker Kubernetes checks	False negatives after pod restarts	Use CEL `test.expr` instead
Hardcoding domains in canary URLs	Breaks across clusters	Use `${internal_domain}` substitution variable
Very short `for` duration on flappy metrics	Alert noise	Use 10m+ for error rates and latencies
Creating alerts for metrics that do not exist yet	Alert permanently in “pending” state	Verify metrics exist in Prometheus before writing rules

Reference: Existing Alert Files

File	Component	Alert Count	Subsystem
`monitoring/cilium-alerts.yaml`	Cilium	14	Agent, BPF, Policy, Network
`monitoring/istio-alerts.yaml`	Istio	~10	Control plane, mTLS, Gateway
`monitoring/cert-manager-alerts.yaml`	cert-manager	5	Expiry, Renewal, Issuance
`monitoring/network-policy-alerts.yaml`	Network Policy	2	Enforcement escape hatch
`monitoring/external-secrets-alerts.yaml`	External Secrets	3	Sync, Ready, Store health
`monitoring/grafana-alerts.yaml`	Grafana	4	Datasource, Errors, Plugins, Down
`monitoring/loki-mixin-alerts.yaml`	Loki	~5	Requests, Latency, Ingester
`monitoring/alloy-alerts.yaml`	Alloy	3	Dropped entries, Errors, Lag
`monitoring/hardware-monitoring-alerts.yaml`	Hardware	7	Temperature, Fans, Disks, Power
`dragonfly/prometheus-rules.yaml`	Dragonfly	2+	Down, Memory
`canary-checker/prometheus-rules.yaml`	canary-checker	2	Check failure, High failure rate

Keywords

PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台