observability-logging

📁 wojons/skills 📅 1 day ago

总安装量

周安装量

#49562

全站排名

安装命令

npx skills add https://github.com/wojons/skills --skill observability-logging

Agent 安装分布

cline 3

gemini-cli 3

github-copilot 3

codex 3

kimi-cli 3

cursor 3

Skill 文档

Observability Logging

Use logs as a core component of comprehensive observability strategy, integrating with metrics, traces, alerts, and dashboards to achieve deep system understanding and operational excellence.

When to use me

Use this skill when:

Building comprehensive observability platforms
Integrating logs with metrics and tracing for full observability
Designing alerting and monitoring systems based on log patterns
Creating dashboards that combine log-derived insights with other telemetry
Implementing SLO/SLA monitoring using log data
Building incident response workflows based on log analysis
Establishing operational excellence practices
Designing on-call procedures and runbooks
Implementing predictive maintenance using log patterns
Building self-healing systems based on observability signals

What I do

1. Log-Driven Metrics

Extract metrics from logs (error rates, latency percentiles, throughput)
Create derived metrics from log patterns and correlations
Implement log-based counters for business and operational events
Calculate Service Level Indicators (SLIs) from log data
Monitor Service Level Objectives (SLOs) using log-derived metrics
Implement burn rate alerts for error budget consumption
Create trend analysis from historical log patterns

2. Log-Enhanced Tracing

Enrich traces with log context for deeper insights
Correlate trace spans with log events for complete request understanding
Implement log-based span creation for legacy or untraced systems
Use logs to fill tracing gaps in distributed systems
Create unified observability views combining logs and traces
Implement log-to-trace linking for seamless investigation
Use trace context in logs for correlation and analysis

3. Alerting & Monitoring

Design log-based alerts for critical patterns and anomalies
Implement alert deduplication and correlation across log sources
Create escalation policies based on log pattern severity
Design alert routing to appropriate teams and individuals
Implement alert enrichment with log context for faster diagnosis
Create suppression rules for known issues and maintenance windows
Monitor alert effectiveness and adjust thresholds based on historical data

4. Dashboard & Visualization

Create operational dashboards combining logs, metrics, and traces
Design service health dashboards with log-derived health indicators
Implement real-time log streaming visualizations
Create trend dashboards showing log pattern evolution
Design incident investigation dashboards with correlated data
Implement customizable views for different stakeholder needs
Create predictive dashboards using machine learning on log data

5. Incident Response

Design log-driven runbooks for common issues
Implement automated remediation based on log patterns
Create incident timelines from log correlation
Design post-mortem analysis using comprehensive log data
Implement blameless retrospectives with observability data
Create knowledge bases from resolved incidents and log patterns
Design escalation procedures based on observability signals

Observability Pillars Integration

Logs + Metrics + Traces = Full Observability

Example: API Service Observability

Logs (What happened):
- "API call to /api/users failed with 500 error"
- "Database connection timeout after 5000ms"
- "Cache miss for user:123"

Metrics (How much/how often):
- Error rate: 5.2%
- P95 latency: 245ms
- Throughput: 1,234 requests/second

Traces (Where in the flow):
- Request flow: API Gateway â Auth Service â User Service â Database
- Time spent: 45ms in Auth, 120ms in User Service, 80ms in Database
- Bottleneck identified: Database query in User Service

Unified Data Model

observability_data:
  logs:
    source: application, infrastructure, audit
    format: structured (JSON)
    fields: [timestamp, level, service, message, context]
    
  metrics:
    source: logs (derived), application (direct), infrastructure
    types: counter, gauge, histogram, summary
    dimensions: [service, endpoint, status_code, user_type]
    
  traces:
    source: instrumentation, log-derived
    context: trace_id, span_id, parent_span_id
    attributes: [service.name, operation.name, duration, status]
    
  correlations:
    log_to_metric: "error logs â error rate metric"
    log_to_trace: "trace_id field links logs to traces"
    metric_to_trace: "high latency metric â trace analysis"

Log-Driven SLO Monitoring

Error Budget Calculation from Logs

def calculate_error_budget_from_logs(logs, slo_target, time_window):
    """
    Calculate error budget consumption from log data
    
    Args:
        logs: List of log entries with timestamp and success status
        slo_target: SLO target (e.g., 0.999 for 99.9%)
        time_window: Time window for calculation in seconds
    
    Returns:
        error_budget_consumption: Percentage of error budget consumed
    """
    total_requests = len(logs)
    successful_requests = sum(1 for log in logs if log.get('status') != 'error')
    
    success_rate = successful_requests / total_requests if total_requests > 0 else 1.0
    error_rate = 1.0 - success_rate
    
    # Calculate error budget consumption
    allowed_errors = (1.0 - slo_target) * total_requests
    actual_errors = total_requests - successful_requests
    error_budget_consumption = actual_errors / allowed_errors if allowed_errors > 0 else float('inf')
    
    return {
        'total_requests': total_requests,
        'successful_requests': successful_requests,
        'success_rate': success_rate,
        'error_rate': error_rate,
        'slo_target': slo_target,
        'allowed_errors': allowed_errors,
        'actual_errors': actual_errors,
        'error_budget_consumption': error_budget_consumption,
        'error_budget_remaining': max(0, 1.0 - error_budget_consumption)
    }

Burn Rate Alerting

alerting:
  error_budget_burn_rate:
    # Alert when burning error budget too quickly
    - name: "high_error_budget_burn_rate"
      condition: "error_budget_burn_rate > 10"
      # Burning 10x faster than allowed
      window: "1h"
      severity: "critical"
    
    - name: "medium_error_budget_burn_rate"
      condition: "error_budget_burn_rate > 2"
      # Burning 2x faster than allowed
      window: "6h"
      severity: "warning"
    
  slo_violation:
    - name: "slo_violation_imminent"
      condition: "error_budget_remaining < 0.1"
      # Less than 10% error budget remaining
      window: "7d"
      severity: "warning"
    
    - name: "slo_violation_occurred"
      condition: "success_rate < slo_target"
      # Actual violation occurring
      window: "1h"
      severity: "critical"

Examples

# Extract metrics from logs for SLO monitoring
npm run observability:extract-metrics -- --slo-target 0.999 --window 7d --output slo-metrics.json

# Create unified observability dashboard
npm run observability:create-dashboard -- --services "api,auth,db" --data-sources "logs,metrics,traces"

# Design log-based alerting rules
npm run observability:design-alerts -- --patterns "error_rate > 5%,latency_p95 > 1000ms" --escalation-policy "team-rotation"

# Implement incident response workflow
npm run observability:incident-workflow -- --trigger "error_spike" --actions "page,create-incident,notify-slack"

# Correlate logs with metrics and traces
npm run observability:correlate -- --time-range "last-1h" --output correlation-analysis.json

Output format

Observability Platform Configuration:

observability:
  data_sources:
    logs:
      collection: [filebeat, fluentd, otel-collector]
      processing: [parsing, enrichment, correlation]
      storage: [elasticsearch, s3]
      
    metrics:
      collection: [prometheus, otel-collector]
      processing: [aggregation, derivation]
      storage: [prometheus, thanos]
      
    traces:
      collection: [otel-collector, jaeger-agent]
      processing: [sampling, enrichment]
      storage: [jaeger, tempo]
  
  correlation:
    fields:
      trace_id: ["trace_id", "trace.id", "X-Trace-Id"]
      service: ["service", "service.name", "component"]
      user_id: ["user_id", "user.id", "userId"]
    
    rules:
      - when: "log.level == 'ERROR'"
        then: "increment_metric('errors_total', labels=log.labels)"
      - when: "trace.duration > 1000"
        then: "log.warning('slow_trace', trace_attributes)"
      - when: "metric.name == 'error_rate' and metric.value > 0.05"
        then: "create_alert('high_error_rate', severity='warning')"
  
  dashboards:
    - name: "Service Health"
      panels:
        - type: "timeseries"
          title: "Error Rate"
          query: "rate(error_logs_total[5m])"
        - type: "histogram"
          title: "Request Latency"
          query: "histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))"
        - type: "log_stream"
          title: "Recent Errors"
          query: "level:ERROR"
    
    - name: "Business Metrics"
      panels:
        - type: "counter"
          title: "User Signups"
          query: "log.message:'User signed up'"
        - type: "timeseries"
          title: "Payment Success Rate"
          query: "successful_payments / total_payments"
  
  alerting:
    rules:
      - alert: "HighErrorRate"
        expr: "rate(error_logs_total[5m]) > 0.05"
        for: "5m"
        labels:
          severity: "critical"
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} which is above 5% threshold"
        
      - alert: "SLOErrorBudgetBurn"
        expr: "error_budget_burn_rate > 10"
        for: "1h"
        labels:
          severity: "warning"
        annotations:
          summary: "Error budget burning too fast"
          description: "Error budget burn rate is {{ $value }}x faster than allowed"
  
  incident_response:
    workflows:
      - trigger: "alert.severity == 'critical'"
        actions:
          - "create_incident"
          - "page_on_call"
          - "notify_slack('#alerts')"
          - "start_zoom_war_room"
      
      - trigger: "incident.created"
        actions:
          - "gather_observability_data"
          - "correlate_logs_metrics_traces"
          - "suggest_runbooks"
          - "update_status_page"

Observability Maturity Assessment:

Observability Maturity Assessment
ââââââââââââââââââââââââââââââââ
Organization: Example Corp
Assessment Date: 2026-02-26
Overall Score: 72/100

Pillar Scores:
- Logging: 85/100 (Structured, correlated, well-managed)
- Metrics: 65/100 (Basic metrics, limited derivation)
- Tracing: 56/100 (Partial implementation, gaps in coverage)
- Alerting: 70/100 (Effective but could be smarter)
- Visualization: 74/100 (Good dashboards, could be more unified)

Integration Assessment:
â Logs include trace context (trace_id, span_id)
â Metrics derived from logs (error rates, throughput)
â ï¸  Traces not fully correlated with logs (60% coverage)
â ï¸  Alerting not using derived SLO metrics
â Dashboards combine multiple data sources

Gap Analysis:
1. Missing: Unified observability data model
2. Missing: Automated correlation across pillars
3. Missing: Predictive analytics on observability data
4. Missing: Self-healing based on observability signals
5. Missing: Comprehensive SLO monitoring

Observability ROI Analysis:
- Current MTTR (Mean Time To Resolution): 45 minutes
- Target MTTR with improved observability: 15 minutes
- Estimated reduction in incident impact: $15,000/month
- Estimated improvement in developer productivity: 20%
- Estimated reduction in on-call burden: 30%

Recommendations:
1. HIGH PRIORITY: Implement unified observability data model
2. HIGH PRIORITY: Improve trace coverage and correlation
3. MEDIUM PRIORITY: Implement SLO-based alerting
4. MEDIUM PRIORITY: Add predictive analytics capabilities
5. LOW PRIORITY: Explore self-healing automation

Implementation Roadmap:
- Phase 1 (1 month): Unified data model and correlation
- Phase 2 (2 months): SLO monitoring and alerting
- Phase 3 (3 months): Predictive analytics
- Phase 4 (6 months): Self-healing capabilities
- Ongoing: Continuous improvement and optimization

Notes

Observability is a journey, not a destination – continuous improvement is essential
Start with the questions you need to answer – design observability around those
Correlation across data sources is more valuable than individual source depth
Consider the cost of observability – balance value with expense
Involve all stakeholders – developers, operators, business teams, executives
Measure observability effectiveness – track MTTR, incident frequency, etc.
Document observability practices – runbooks, dashboards, alert definitions
Regularly review and refine – observability needs evolve with the system
Balance automation with human insight – don’t automate away necessary human judgment
Security and compliance considerations – observability data may contain sensitive information

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台