datadog-observability

📁 ethanrcohen/datadog-agent-skill 📅 2 days ago

总安装量

周安装量

#42723

全站排名

安装命令

npx skills add https://github.com/ethanrcohen/datadog-agent-skill --skill datadog-observability

Agent 安装分布

amp 1

opencode 1

cursor 1

droid 1

codex 1

Skill 文档

Datadog Observability Skill (via pup)

Requires: pup CLI, authenticated via pup auth login or DD_API_KEY + DD_APP_KEY env vars.

Choose Your Workflow

Goal	Command
Find errors in a service	Search Logs
Count errors / compute metrics	Aggregate Logs
Query time-series metrics	Query Metrics
List APM services + perf stats	APM Services
View service dependencies	APM Dependencies

Search Logs

Returns log entries matching a Datadog query.

# Errors in a service in the last hour
pup logs search --query="service:payment AND status:error" --from="1h"

# Filter by service + environment
pup logs search --query="service:user-service AND env:production" --from="15m"

# Advanced attribute filters
pup logs search --query="service:payment AND @duration:>5s" --from="1h"

# Control result count
pup logs search --query="service:payment AND status:error" --from="24h" --limit=200

# Sort oldest first
pup logs search --query="status:error" --from="1h" --sort="asc"

Search Flags

Flag	Description
`--query`	Datadog query string (required)
`--from`	Start time: relative (`1h`, `30m`, `7d`) or Unix ms (required)
`--to`	End time (default: `now`)
`--limit`	Max results (default: 50, max: 1000)
`--sort`	`asc` or `desc` (default: desc)
`--index`	Comma-separated log indexes
`--output` / `-o`	`json` (default), `table`, `yaml`

Aggregate Logs

Compute metrics from logs — counts, averages, percentiles. Useful for triage.

# How many errors per service in the last 24h?
pup logs aggregate --query="status:error" --from="24h" --compute="count" --group-by="service"

# Average request duration by service
pup logs aggregate --query="*" --from="1h" --compute="avg(@duration)" --group-by="service"

# 99th percentile latency
pup logs aggregate --query="service:api" --from="2h" --compute="percentile(@duration, 99)"

# Error count by HTTP status code
pup logs aggregate --query="status:error" --from="1d" --compute="count" --group-by="@http.status_code"

Compute Options

Compute	Example	Description
`count`	`--compute="count"`	Count matching logs
`avg(metric)`	`--compute="avg(@duration)"`	Average of a numeric attribute
`sum(metric)`	`--compute="sum(@bytes)"`	Sum
`min(metric)`	`--compute="min(@latency)"`	Minimum
`max(metric)`	`--compute="max(@latency)"`	Maximum
`cardinality(field)`	`--compute="cardinality(@user.id)"`	Unique values
`percentile(metric, N)`	`--compute="percentile(@duration, 99)"`	Percentile

Query Metrics

Query time-series metrics data.

# CPU usage across all hosts in the last hour
pup metrics query --query="avg:system.cpu.user{*}" --from="1h"

# Memory for a specific service in production
pup metrics query --query="avg:system.mem.used{service:web,env:prod}" --from="4h"

# Search for available metrics
pup metrics list --filter="system.cpu.*"

# Get metadata for a specific metric
pup metrics get system.cpu.user

Metrics Flags

Flag	Description
`--query`	Datadog metrics query (required)
`--from`	Start time: relative (`1h`, `30m`, `7d`) or Unix ms (required)
`--to`	End time (default: now)
`--output` / `-o`	`json` (default), `table`, `yaml`

APM Services

List services and their performance statistics. Note: APM commands use Unix timestamps (not relative time).

# List all APM services
pup apm services list

# Service performance stats (last hour)
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s)

# Filter by environment
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod

# List operations for a service
pup apm services operations web-server --start=$(date -v-1H +%s) --end=$(date +%s)

# List resources (endpoints) for a service operation
pup apm services resources web-server --operation="GET /api/users" --from=$(date -v-1H +%s) --to=$(date +%s)

APM Flags

Flag	Description
`--start`	Start time as Unix timestamp (required for stats/operations)
`--end`	End time as Unix timestamp (required for stats/operations)
`--env`	Filter by environment
`--primary-tag`	Filter by primary tag (`group:value`)
`--output` / `-o`	`json` (default), `table`, `yaml`

APM Dependencies

View service call relationships based on trace data.

# All service dependencies in production
pup apm dependencies list --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)

# Dependencies for a specific service
pup apm dependencies list web-server --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)

# Service flow map with performance metrics
pup apm flow-map --query="env:prod" --from=$(date -v-1H +%s) --to=$(date +%s)

Datadog Query Syntax

All query filters use Datadog’s standard search syntax:

service:my-service              Filter by service
status:error                    Filter by log status
host:my-host                    Filter by host
env:production                  Filter by environment
@duration:>5s                   Numeric attribute filter
"exact phrase"                  Exact match
service:web AND status:error    Boolean operators (AND, OR, NOT)
service:web-*                   Wildcards
-status:info                    Negation

Output Formats

All commands support --output / -o:

Format	Flag	Use when
JSON	`--output json` (default)	Piping to `jq`, programmatic analysis
Table	`--output table`	Human-readable overview
YAML	`--output yaml`	Configuration-style output

# Pipe JSON to jq for field selection
pup logs search --query="status:error" --from="1h" | jq '.data[].attributes.message'

# Human-readable table
pup logs search --query="status:error" --from="1h" --output table

Common Investigation Patterns

# 1. Start broad: what services have errors?
pup logs aggregate --query="status:error" --from="1h" --compute="count" --group-by="service"

# 2. Drill into the top offender
pup logs search --query="service:payment AND status:error" --from="1h" --output table

# 3. Get full JSON details for a specific timeframe
pup logs search --query="service:payment AND status:error" --from="30m" --limit=10

# 4. Check if it's environment-specific
pup logs aggregate --query="service:payment AND status:error" --from="1h" --compute="count" --group-by="env"

# 5. Check APM service health
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod

# 6. View service dependencies
pup apm dependencies list payment --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)

# 7. Check a specific metric
pup metrics query --query="avg:trace.servlet.request.duration{service:payment}" --from="1h"

Time Ranges

Logs & Metrics accept relative durations:

Input	Meaning
`1h`	1 hour ago
`30m`	30 minutes ago
`7d`	7 days ago
`1w`	1 week ago
`now`	Current time (default for –to)

APM commands require Unix timestamps. Use date to compute them:

Shell	1 hour ago	Now
macOS	`$(date -v-1H +%s)`	`$(date +%s)`
Linux	`$(date -d '1 hour ago' +%s)`	`$(date +%s)`

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台