monitoring
8
总安装量
5
周安装量
#34538
全站排名
安装命令
npx skills add https://github.com/chaterm/terminal-skills --skill monitoring
Agent 安装分布
claude-code
5
opencode
4
codex
3
antigravity
3
windsurf
3
github-copilot
3
Skill 文档
çæ§ä¸åè¦
æ¦è¿°
PrometheusãGrafanaãåè¦è§åé ç½®çæè½ã
Prometheus
åºç¡æ¥è¯¢ï¼PromQLï¼
# 峿¶åé
http_requests_total
http_requests_total{job="api", status="200"}
# èå´åé
http_requests_total[5m]
# åç§»
http_requests_total offset 1h
# èå
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)
# éç
rate(http_requests_total[5m])
irate(http_requests_total[5m])
# å¢é
increase(http_requests_total[1h])
# ç´æ¹å¾å使°
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
å¸¸ç¨æ¥è¯¢
# CPU 使ç¨ç
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# å
å使ç¨ç
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# ç£ç使ç¨ç
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
# ç½ç»æµé
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# HTTP 请æ±éç
sum(rate(http_requests_total[5m])) by (status)
# é误ç
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# å»¶è¿ P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
é ç½®æä»¶
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node1:9100', 'node2:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
åè¦è§å
# rules/alerts.yml
groups:
- name: node
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency"
Alertmanager
é ç½®
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Grafana
æ°æ®æºé ç½®
# provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Dashboard JSON 示ä¾
{
"dashboard": {
"title": "Node Metrics",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [
{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
}
]
}
]
}
}
常ç¨é¢æ¿æ¥è¯¢
# CPU 使ç¨çï¼æ¶é´åºåï¼
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# å
å使ç¨ï¼ä»ªè¡¨çï¼
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# 请æ±éçï¼æ±ç¶å¾ï¼
sum(rate(http_requests_total[5m])) by (status)
# å»¶è¿çåå¾
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
常è§åºæ¯
åºæ¯ 1ï¼Kubernetes çæ§
# ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 15s
path: /metrics
åºæ¯ 2ï¼èªå®ä¹ææ
# Python åºç¨
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])
@REQUEST_LATENCY.labels(method='GET', endpoint='/api').time()
def handle_request():
REQUEST_COUNT.labels(method='GET', endpoint='/api', status='200').inc()
# ...
start_http_server(8000)
åºæ¯ 3ï¼SLO çæ§
# å¯ç¨æ§ SLO (99.9%)
1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))
# é误é¢ç®æ¶è
(1 - (sum(rate(http_requests_total{status=~"5.."}[7d])) / sum(rate(http_requests_total[7d])))) / 0.999
# å»¶è¿ SLO (P99 < 500ms)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[30d])) by (le)) < 0.5
åºæ¯ 4ï¼åè¦éé»
# å建éé»
amtool silence add alertname=HighCPUUsage instance=node1 --duration=2h --comment="Maintenance"
# æ¥çéé»
amtool silence query
# å é¤éé»
amtool silence expire <silence-id>
æ éææ¥
| é®é¢ | ææ¥æ¹æ³ |
|---|---|
| ææ ç¼ºå¤± | æ£æ¥ scrape é ç½®ãtarget ç¶æ |
| åè¦ä¸è§¦å | æ£æ¥è§åè¯æ³ãAlertmanager é ç½® |
| æ¥è¯¢æ ¢ | ä¼å PromQLãå¢å éæ ·é´é |
| å卿»¡ | è°æ´ retentionãæ¸ çæ§æ°æ® |
# æ£æ¥ Prometheus targets
curl http://prometheus:9090/api/v1/targets
# æ£æ¥åè¦è§å
curl http://prometheus:9090/api/v1/rules
# æ£æ¥ Alertmanager ç¶æ
curl http://alertmanager:9093/api/v1/status
# æµè¯ PromQL
curl 'http://prometheus:9090/api/v1/query?query=up'