log-analysis
11
总安装量
10
周安装量
#28396
全站排名
安装命令
npx skills add https://github.com/incidentfox/incidentfox --skill log-analysis
Agent 安装分布
amp
10
claude-code
10
github-copilot
10
codex
10
kimi-cli
10
gemini-cli
10
Skill 文档
Log Analysis Methodology
Core Philosophy: Partition-First
NEVER start by reading raw log samples.
Logs can be overwhelming. The partition-first approach prevents:
- Missing the forest for the trees
- Wasting time on irrelevant data
- Overwhelming context with noise
The 4-Step Process
Step 1: Get Statistics
Before ANY log search, understand the landscape:
CloudWatch Insights:
# How many errors?
filter @message like /ERROR/
| stats count(*) as total
# Error rate over time
filter @message like /ERROR/
| stats count(*) by bin(5m)
# What types of errors?
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| sort count desc
Datadog:
# Error distribution by service
service:* status:error | stats count by service
# Error types
service:myapp status:error | stats count by @error.kind
Questions to answer:
- What’s the total error volume?
- Is it increasing, stable, or decreasing?
- What are the unique error types?
- Which services/hosts are affected?
Step 2: Identify Patterns
Look for correlations:
Temporal patterns:
- Did errors start at a specific time?
- Is there periodicity (every hour, every day)?
- Correlation with deployments or traffic spikes?
Service patterns:
- Is one service the source?
- Is the error propagating across services?
Error patterns:
- What’s the most frequent error?
- Are errors clustered or distributed?
Step 3: Sample Strategically
Only NOW read actual log samples:
Sample from anomalies:
- Get logs from the peak error time
- Get logs from normal time for comparison
Sample by error type:
- Get examples of each distinct error type
- Limit to 5-10 per type
Sample around events:
- Logs before/after a deployment
- Logs around a specific incident timestamp
Step 4: Correlate with Events
Connect logs to system changes:
# Use git_log to find recent deployments
git_log --since="2 hours ago"
# Use get_deployment_history for K8s
get_deployment_history deployment=api-server
# Compare log patterns before/after changes
Platform-Specific Tips
CloudWatch Insights
Best practices:
# Always include time filter
filter @timestamp > ago(1h)
# Use parse for structured extraction
parse @message /status=(?<status>\d+)/
# Aggregate before displaying
stats count(*) by status | sort count desc | limit 10
Common queries:
# Latency distribution
filter @type = "REPORT"
| stats avg(@duration) as avg,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99
# Error messages with context
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20
Datadog Logs
Query syntax:
# Filter by service and status
service:api-gateway status:error
# Field queries
@http.status_code:>=500
# Wildcard
@error.message:*timeout*
# Time comparison
service:api (now-1h TO now) vs (now-25h TO now-24h)
Kubernetes Logs
Use get_pod_logs wisely:
- Always specify
tail_lines(default: 100) - Filter to specific containers in multi-container pods
- Use
get_pod_eventsfirst for crashes/restarts
Anti-Patterns to Avoid
- Dumping all logs – Never request unbounded log queries
- Starting with samples – Always get statistics first
- Ignoring time windows – Narrow to incident window
- Missing correlation – Always connect to deployments/changes
- Single-service focus – Check upstream/downstream services
Investigation Template
## Log Analysis Report
### Statistics
- Time window: [start] to [end]
- Total log volume: X events
- Error count: Y events (Z%)
- Error rate trend: [increasing/stable/decreasing]
### Top Error Types
1. [ErrorType1]: N occurrences - [description]
2. [ErrorType2]: M occurrences - [description]
### Temporal Pattern
- Errors started at: [timestamp]
- Correlation: [deployment X / traffic spike / external event]
### Sample Errors
[Quote 2-3 representative error messages]
### Root Cause Hypothesis
[Based on patterns, what's the likely cause?]