datadog-observability
1
总安装量
1
周安装量
#42723
全站排名
安装命令
npx skills add https://github.com/ethanrcohen/datadog-agent-skill --skill datadog-observability
Agent 安装分布
amp
1
opencode
1
cursor
1
droid
1
codex
1
Skill 文档
Datadog Observability Skill (via pup)
Requires: pup CLI, authenticated via pup auth login or DD_API_KEY + DD_APP_KEY env vars.
Choose Your Workflow
| Goal | Command |
|---|---|
| Find errors in a service | Search Logs |
| Count errors / compute metrics | Aggregate Logs |
| Query time-series metrics | Query Metrics |
| List APM services + perf stats | APM Services |
| View service dependencies | APM Dependencies |
Search Logs
Returns log entries matching a Datadog query.
# Errors in a service in the last hour
pup logs search --query="service:payment AND status:error" --from="1h"
# Filter by service + environment
pup logs search --query="service:user-service AND env:production" --from="15m"
# Advanced attribute filters
pup logs search --query="service:payment AND @duration:>5s" --from="1h"
# Control result count
pup logs search --query="service:payment AND status:error" --from="24h" --limit=200
# Sort oldest first
pup logs search --query="status:error" --from="1h" --sort="asc"
Search Flags
| Flag | Description |
|---|---|
--query |
Datadog query string (required) |
--from |
Start time: relative (1h, 30m, 7d) or Unix ms (required) |
--to |
End time (default: now) |
--limit |
Max results (default: 50, max: 1000) |
--sort |
asc or desc (default: desc) |
--index |
Comma-separated log indexes |
--output / -o |
json (default), table, yaml |
Aggregate Logs
Compute metrics from logs — counts, averages, percentiles. Useful for triage.
# How many errors per service in the last 24h?
pup logs aggregate --query="status:error" --from="24h" --compute="count" --group-by="service"
# Average request duration by service
pup logs aggregate --query="*" --from="1h" --compute="avg(@duration)" --group-by="service"
# 99th percentile latency
pup logs aggregate --query="service:api" --from="2h" --compute="percentile(@duration, 99)"
# Error count by HTTP status code
pup logs aggregate --query="status:error" --from="1d" --compute="count" --group-by="@http.status_code"
Compute Options
| Compute | Example | Description |
|---|---|---|
count |
--compute="count" |
Count matching logs |
avg(metric) |
--compute="avg(@duration)" |
Average of a numeric attribute |
sum(metric) |
--compute="sum(@bytes)" |
Sum |
min(metric) |
--compute="min(@latency)" |
Minimum |
max(metric) |
--compute="max(@latency)" |
Maximum |
cardinality(field) |
--compute="cardinality(@user.id)" |
Unique values |
percentile(metric, N) |
--compute="percentile(@duration, 99)" |
Percentile |
Query Metrics
Query time-series metrics data.
# CPU usage across all hosts in the last hour
pup metrics query --query="avg:system.cpu.user{*}" --from="1h"
# Memory for a specific service in production
pup metrics query --query="avg:system.mem.used{service:web,env:prod}" --from="4h"
# Search for available metrics
pup metrics list --filter="system.cpu.*"
# Get metadata for a specific metric
pup metrics get system.cpu.user
Metrics Flags
| Flag | Description |
|---|---|
--query |
Datadog metrics query (required) |
--from |
Start time: relative (1h, 30m, 7d) or Unix ms (required) |
--to |
End time (default: now) |
--output / -o |
json (default), table, yaml |
APM Services
List services and their performance statistics. Note: APM commands use Unix timestamps (not relative time).
# List all APM services
pup apm services list
# Service performance stats (last hour)
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s)
# Filter by environment
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod
# List operations for a service
pup apm services operations web-server --start=$(date -v-1H +%s) --end=$(date +%s)
# List resources (endpoints) for a service operation
pup apm services resources web-server --operation="GET /api/users" --from=$(date -v-1H +%s) --to=$(date +%s)
APM Flags
| Flag | Description |
|---|---|
--start |
Start time as Unix timestamp (required for stats/operations) |
--end |
End time as Unix timestamp (required for stats/operations) |
--env |
Filter by environment |
--primary-tag |
Filter by primary tag (group:value) |
--output / -o |
json (default), table, yaml |
APM Dependencies
View service call relationships based on trace data.
# All service dependencies in production
pup apm dependencies list --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)
# Dependencies for a specific service
pup apm dependencies list web-server --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)
# Service flow map with performance metrics
pup apm flow-map --query="env:prod" --from=$(date -v-1H +%s) --to=$(date +%s)
Datadog Query Syntax
All query filters use Datadog’s standard search syntax:
service:my-service Filter by service
status:error Filter by log status
host:my-host Filter by host
env:production Filter by environment
@duration:>5s Numeric attribute filter
"exact phrase" Exact match
service:web AND status:error Boolean operators (AND, OR, NOT)
service:web-* Wildcards
-status:info Negation
Output Formats
All commands support --output / -o:
| Format | Flag | Use when |
|---|---|---|
| JSON | --output json (default) |
Piping to jq, programmatic analysis |
| Table | --output table |
Human-readable overview |
| YAML | --output yaml |
Configuration-style output |
# Pipe JSON to jq for field selection
pup logs search --query="status:error" --from="1h" | jq '.data[].attributes.message'
# Human-readable table
pup logs search --query="status:error" --from="1h" --output table
Common Investigation Patterns
# 1. Start broad: what services have errors?
pup logs aggregate --query="status:error" --from="1h" --compute="count" --group-by="service"
# 2. Drill into the top offender
pup logs search --query="service:payment AND status:error" --from="1h" --output table
# 3. Get full JSON details for a specific timeframe
pup logs search --query="service:payment AND status:error" --from="30m" --limit=10
# 4. Check if it's environment-specific
pup logs aggregate --query="service:payment AND status:error" --from="1h" --compute="count" --group-by="env"
# 5. Check APM service health
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod
# 6. View service dependencies
pup apm dependencies list payment --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)
# 7. Check a specific metric
pup metrics query --query="avg:trace.servlet.request.duration{service:payment}" --from="1h"
Time Ranges
Logs & Metrics accept relative durations:
| Input | Meaning |
|---|---|
1h |
1 hour ago |
30m |
30 minutes ago |
7d |
7 days ago |
1w |
1 week ago |
now |
Current time (default for –to) |
APM commands require Unix timestamps. Use date to compute them:
| Shell | 1 hour ago | Now |
|---|---|---|
| macOS | $(date -v-1H +%s) |
$(date +%s) |
| Linux | $(date -d '1 hour ago' +%s) |
$(date +%s) |