logging best practices
1
总安装量
2
周安装量
#52135
全站排名
安装命令
npx skills add https://github.com/nbbaier/agent-skills --skill logging-best-practices
Agent 安装分布
opencode
2
antigravity
2
claude-code
2
codex
2
amp
2
Skill 文档
This skill is adpated from “Logging sucks. And here’s how to make it better. by Boris Tane.
When helping with logging, observability, or debugging strategies, follow these principles:
Core Philosophy
- Logs are optimized for querying, not writing â always design with debugging in mind
- Context is everything â a log without correlation IDs is useless in distributed systems
- Logs are for humans during incidents, not just for compliance or “just in case”
- If you can’t filter and search your logs effectively, they provide zero value
Structured Logging Requirements
- Always use key-value pairs (JSON) instead of string interpolation
- Bad: “Payment failed for user 123”
- Good: {“event”: “payment_failed”, “user_id”: “123”, “reason”: “insufficient_funds”, “amount”: 99.99}
- Structured logs are machine-parseable, enabling aggregation, alerting, and dashboards
Required Fields for Every Log Event
- timestamp â ISO 8601 with timezone (e.g., 2025-01-24T20:00:00Z)
- level â debug, info, warn, error (be consistent, don’t invent new levels)
- event â machine-readable event name, snake_case (e.g., user_login_success)
- request_id or trace_id â for correlating logs across a single request
- service â which service/application emitted this log
- environment â prod, staging, dev
Examples of High-Cardinality Fields (always include when available):
- user_id, org_id, account_id â who is affected
- request_id, trace_id, span_id â for distributed tracing
- order_id, transaction_id, job_id â domain-specific identifiers
These fields are what make logs actually queryable during incidents. Without them, you’re grepping through millions of lines blindly.
Look for opportunities for high-cardinality fields that can help you identify the root cause of an issue quickly.
Context Propagation
- Pass trace/request IDs through all service boundaries (HTTP headers, message queues, etc.)
- Downstream services must inherit correlation IDs from upstream
- Use middleware or interceptors to automatically inject context into every log
- For async jobs, store and restore the original request context
Log Levels â Use Them Correctly:
- debug â Verbose details for local development, usually disabled in production
- info â Normal operations worth recording (user actions, job completions, deploys)
- warn â Something unexpected happened but the system handled it (retries, fallbacks)
- error â Something failed and likely needs human attention (exceptions, failed requests)
Don’t log errors for expected conditions (e.g., user enters wrong password)
What to Log:
- Request entry and exit points (with duration)
- State transitions (order created â paid â shipped)
- External service calls (with latency and response codes)
- Authentication and authorization events
- Background job starts, completions, and failures
- Retry attempts and circuit breaker state changes
What NOT to Log:
- Sensitive data (passwords, tokens, PII, credit card numbers)
- Logs inside tight loops (will generate millions of useless entries)
- Success cases that provide no debugging value
- Redundant information already captured by infrastructure (load balancer logs, etc.)
Naming Conventions:
- Be consistent across all services â agree on field names as a team
- Use snake_case for field names: user_id, not userId or user-id
- Use past-tense verbs for events: payment_completed, not complete_payment
- Prefix events by domain when helpful: auth.login_failed, billing.invoice_created
Performance Considerations:
- Use sampling for high-volume debug logs in production
- Avoid logging inside hot paths unless absolutely necessary
- Buffer and batch log writes to reduce I/O overhead
- Consider log levels that can be changed at runtime without redeploying
During Incidents:
- Your logs should answer: Who was affected? What failed? When? Why?
- If you can’t answer these within 5 minutes of querying, your logging strategy needs work
- Post-incident: add the logs you wished you had