datadog

📁 bagelhole/devops-security-agent-skills 📅 9 days ago
1
总安装量
1
周安装量
#51960
全站排名
安装命令
npx skills add https://github.com/bagelhole/devops-security-agent-skills --skill datadog

Agent 安装分布

opencode 1
codex 1
claude-code 1

Skill 文档

Datadog

Monitor infrastructure and applications with Datadog’s unified observability platform.

When to Use This Skill

Use this skill when:

  • Implementing enterprise-grade monitoring
  • Setting up APM and distributed tracing
  • Creating unified dashboards for infrastructure and apps
  • Configuring intelligent alerting
  • Monitoring cloud infrastructure (AWS, Azure, GCP)

Prerequisites

  • Datadog account and API key
  • Agent installation access
  • Application code access for APM

Agent Installation

Linux

# Install agent
DD_API_KEY=<YOUR_API_KEY> DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

# Or via package manager
apt-get update && apt-get install datadog-agent

# Configure API key
echo "api_key: YOUR_API_KEY" >> /etc/datadog-agent/datadog.yaml

# Start agent
systemctl start datadog-agent
systemctl enable datadog-agent

Docker

# docker-compose.yml
version: '3.8'

services:
  datadog-agent:
    image: gcr.io/datadoghq/agent:7
    environment:
      - DD_API_KEY=${DD_API_KEY}
      - DD_SITE=datadoghq.com
      - DD_LOGS_ENABLED=true
      - DD_APM_ENABLED=true
      - DD_PROCESS_AGENT_ENABLED=true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
    ports:
      - "8126:8126"  # APM
      - "8125:8125/udp"  # DogStatsD

Kubernetes

# Using Helm
helm repo add datadog https://helm.datadoghq.com

helm install datadog datadog/datadog \
  --set datadog.apiKey=${DD_API_KEY} \
  --set datadog.site=datadoghq.com \
  --set datadog.logs.enabled=true \
  --set datadog.apm.portEnabled=true \
  --set datadog.processAgent.enabled=true \
  --namespace datadog \
  --create-namespace

Agent Configuration

# /etc/datadog-agent/datadog.yaml
api_key: YOUR_API_KEY
site: datadoghq.com

# Hostname
hostname: myserver.example.com

# Tags applied to all metrics
tags:
  - env:production
  - service:myapp
  - team:platform

# Log collection
logs_enabled: true

# APM
apm_config:
  enabled: true
  apm_dd_url: https://trace.agent.datadoghq.com

# Process monitoring
process_config:
  enabled: true

# Container monitoring
container_collect_all: true
docker_labels_as_tags:
  app: service
  environment: env

Integration Configuration

MySQL

# /etc/datadog-agent/conf.d/mysql.d/conf.yaml
init_config:

instances:
  - host: localhost
    port: 3306
    username: datadog
    password: <PASSWORD>
    tags:
      - env:production
    options:
      replication: true
      extra_status_metrics: true

PostgreSQL

# /etc/datadog-agent/conf.d/postgres.d/conf.yaml
init_config:

instances:
  - host: localhost
    port: 5432
    username: datadog
    password: <PASSWORD>
    dbname: mydb
    collect_activity_metrics: true
    collect_database_size_metrics: true

NGINX

# /etc/datadog-agent/conf.d/nginx.d/conf.yaml
init_config:

instances:
  - nginx_status_url: http://localhost:80/nginx_status
    tags:
      - env:production

Log Collection

File-Based Logs

# /etc/datadog-agent/conf.d/myapp.d/conf.yaml
logs:
  - type: file
    path: /var/log/myapp/*.log
    service: myapp
    source: python
    sourcecategory: custom
    tags:
      - env:production

  - type: file
    path: /var/log/nginx/access.log
    service: nginx
    source: nginx
    log_processing_rules:
      - type: exclude_at_match
        name: exclude_healthchecks
        pattern: health_check

Docker Logs

# docker-compose.yml
services:
  myapp:
    labels:
      com.datadoghq.ad.logs: '[{"source": "python", "service": "myapp"}]'

Kubernetes Logs

# Pod annotation
apiVersion: v1
kind: Pod
metadata:
  annotations:
    ad.datadoghq.com/myapp.logs: |
      [{
        "source": "python",
        "service": "myapp",
        "log_processing_rules": [{
          "type": "multi_line",
          "name": "python_tracebacks",
          "pattern": "^Traceback"
        }]
      }]

APM Configuration

Python

from ddtrace import patch_all, tracer

# Automatic instrumentation
patch_all()

# Configure tracer
tracer.configure(
    hostname='localhost',
    port=8126,
    service='myapp',
    env='production',
    version='1.0.0'
)

# Manual instrumentation
@tracer.wrap(service='myapp', resource='process_order')
def process_order(order_id):
    with tracer.trace('validate_order') as span:
        span.set_tag('order_id', order_id)
        # Validation logic
    
    with tracer.trace('save_order'):
        # Save logic
        pass
# Install library
pip install ddtrace

# Run with auto-instrumentation
ddtrace-run python app.py

Node.js

const tracer = require('dd-trace').init({
  service: 'myapp',
  env: 'production',
  version: '1.0.0',
  logInjection: true
});

// Manual instrumentation
const span = tracer.startSpan('custom_operation');
span.setTag('user_id', userId);
// ... operation
span.finish();
# Install library
npm install dd-trace

# Run with auto-instrumentation
DD_TRACE_ENABLED=true node --require dd-trace/init app.js

Go

import (
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func main() {
    tracer.Start(
        tracer.WithService("myapp"),
        tracer.WithEnv("production"),
        tracer.WithServiceVersion("1.0.0"),
    )
    defer tracer.Stop()

    // Manual span
    span, ctx := tracer.StartSpanFromContext(ctx, "process_request")
    defer span.Finish()
    span.SetTag("user_id", userID)
}

Custom Metrics

DogStatsD

from datadog import DogStatsd

statsd = DogStatsd(host='localhost', port=8125)

# Counter
statsd.increment('myapp.orders.count', tags=['env:production'])

# Gauge
statsd.gauge('myapp.queue.size', queue_size, tags=['queue:orders'])

# Histogram
statsd.histogram('myapp.request.duration', response_time)

# Distribution
statsd.distribution('myapp.response_time', duration, tags=['endpoint:/api/orders'])

API Submission

from datadog_api_client import Configuration, ApiClient
from datadog_api_client.v2.api.metrics_api import MetricsApi
from datadog_api_client.v2.model.metric_payload import MetricPayload
from datadog_api_client.v2.model.metric_series import MetricSeries
from datadog_api_client.v2.model.metric_point import MetricPoint

configuration = Configuration()
with ApiClient(configuration) as api_client:
    api = MetricsApi(api_client)
    
    payload = MetricPayload(
        series=[
            MetricSeries(
                metric="custom.metric.name",
                type=MetricSeries.GAUGE,
                points=[MetricPoint(value=42.0, timestamp=int(time.time()))],
                tags=["env:production"]
            )
        ]
    )
    api.submit_metrics(body=payload)

Dashboards

Dashboard JSON

{
  "title": "Application Overview",
  "widgets": [
    {
      "definition": {
        "type": "timeseries",
        "title": "Request Rate",
        "requests": [
          {
            "q": "sum:trace.http.request.hits{service:myapp}.as_rate()",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "query_value",
        "title": "Error Rate",
        "requests": [
          {
            "q": "sum:trace.http.request.errors{service:myapp}.as_rate() / sum:trace.http.request.hits{service:myapp}.as_rate() * 100"
          }
        ],
        "precision": 2
      }
    }
  ]
}

Monitors (Alerts)

Metric Monitor

{
  "name": "High Error Rate",
  "type": "metric alert",
  "query": "sum(last_5m):sum:trace.http.request.errors{service:myapp}.as_count() / sum:trace.http.request.hits{service:myapp}.as_count() > 0.05",
  "message": "Error rate is {{value}}% for {{service.name}}. @slack-alerts",
  "tags": ["service:myapp", "env:production"],
  "options": {
    "thresholds": {
      "critical": 0.05,
      "warning": 0.02
    },
    "notify_no_data": true,
    "no_data_timeframe": 10
  }
}

APM Monitor

{
  "name": "High Latency Alert",
  "type": "trace-analytics alert",
  "query": "trace-analytics(\"service:myapp @http.status_code:2*\").rollup(\"avg\", \"@duration\").last(\"5m\") > 2000000000",
  "message": "Average latency is above 2 seconds. @pagerduty",
  "options": {
    "thresholds": {
      "critical": 2000000000
    }
  }
}

Common Issues

Issue: Agent Not Reporting

Problem: No data appearing in Datadog Solution: Check API key, verify agent status with datadog-agent status

Issue: Missing Traces

Problem: APM traces not appearing Solution: Verify APM is enabled, check tracer configuration, verify port 8126

Issue: High Cardinality Tags

Problem: Custom metrics getting dropped Solution: Reduce unique tag values, use distributions instead of histograms

Best Practices

  • Use consistent service and environment tags
  • Implement proper tag naming conventions
  • Use unified service tagging (service, env, version)
  • Set up service-level monitors
  • Create dashboards per service
  • Implement log correlation with traces
  • Use distributions for latency metrics
  • Configure proper alert escalation

Related Skills