data-exploration

📁 anthropics/knowledge-work-plugins 📅 13 days ago
48
总安装量
49
周安装量
#4404
全站排名
安装命令
npx skills add https://github.com/anthropics/knowledge-work-plugins --skill data-exploration

Agent 安装分布

opencode 41
claude-code 39
codex 35
github-copilot 26
replit 22

Skill 文档

Data Exploration Skill

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.

Data Profiling Methodology

Phase 1: Structural Understanding

Before analyzing any data, understand its structure:

Table-level questions:

  • How many rows and columns?
  • What is the grain (one row per what)?
  • What is the primary key? Is it unique?
  • When was the data last updated?
  • How far back does the data go?

Column classification: Categorize each column as one of:

  • Identifier: Unique keys, foreign keys, entity IDs
  • Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
  • Metric: Quantitative values for measurement (revenue, count, duration, score)
  • Temporal: Dates and timestamps (created_at, updated_at, event_date)
  • Text: Free-form text fields (description, notes, name)
  • Boolean: True/false flags
  • Structural: JSON, arrays, nested structures

Phase 2: Column-Level Profiling

For each column, compute:

All columns:

  • Null count and null rate
  • Distinct count and cardinality ratio (distinct / total)
  • Most common values (top 5-10 with frequencies)
  • Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Phase 3: Relationship Discovery

After profiling individual columns:

  • Foreign key candidates: ID columns that might link to other tables
  • Hierarchies: Columns that form natural drill-down paths (country > state > city)
  • Correlations: Numeric columns that move together
  • Derived columns: Columns that appear to be computed from others
  • Redundant columns: Columns with identical or near-identical information

Quality Assessment Framework

Completeness Score

Rate each column:

  • Complete (>99% non-null): Green
  • Mostly complete (95-99%): Yellow — investigate the nulls
  • Incomplete (80-95%): Orange — understand why and whether it matters
  • Sparse (<80%): Red — may not be usable without imputation

Consistency Checks

Look for:

  • Value format inconsistency: Same concept represented differently (“USA”, “US”, “United States”, “us”)
  • Type inconsistency: Numbers stored as strings, dates in various formats
  • Referential integrity: Foreign keys that don’t match any parent record
  • Business rule violations: Negative quantities, end dates before start dates, percentages > 100
  • Cross-column consistency: Status = “completed” but completed_at is null

Accuracy Indicators

Red flags that suggest accuracy issues:

  • Placeholder values: 0, -1, 999999, “N/A”, “TBD”, “test”, “xxx”
  • Default values: Suspiciously high frequency of a single value
  • Stale data: Updated_at shows no recent changes in an active system
  • Impossible values: Ages > 150, dates in the far future, negative durations
  • Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)

Timeliness Assessment

  • When was the table last updated?
  • What is the expected update frequency?
  • Is there a lag between event time and load time?
  • Are there gaps in the time series?

Pattern Discovery Techniques

Distribution Analysis

For numeric columns, characterize the distribution:

  • Normal: Mean and median are close, bell-shaped
  • Skewed right: Long tail of high values (common for revenue, session duration)
  • Skewed left: Long tail of low values (less common)
  • Bimodal: Two peaks (suggests two distinct populations)
  • Power law: Few very large values, many small ones (common for user activity)
  • Uniform: Roughly equal frequency across range (often synthetic or random)

Temporal Patterns

For time series data, look for:

  • Trend: Sustained upward or downward movement
  • Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
  • Day-of-week effects: Weekday vs. weekend differences
  • Holiday effects: Drops or spikes around known holidays
  • Change points: Sudden shifts in level or trend
  • Anomalies: Individual data points that break the pattern

Segmentation Discovery

Identify natural segments by:

  • Finding categorical columns with 3-20 distinct values
  • Comparing metric distributions across segment values
  • Looking for segments with significantly different behavior
  • Testing whether segments are homogeneous or contain sub-segments

Correlation Exploration

Between numeric columns:

  • Compute correlation matrix for all metric pairs
  • Flag strong correlations (|r| > 0.7) for investigation
  • Note: Correlation does not imply causation — flag this explicitly
  • Check for non-linear relationships (e.g., quadratic, logarithmic)

Schema Understanding and Documentation

Schema Documentation Template

When documenting a dataset for team use:

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

Schema Exploration Queries

When connected to a data warehouse, use these patterns to discover schema:

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

Lineage and Dependencies

When exploring an unfamiliar data environment:

  1. Start with the “output” tables (what reports or dashboards consume)
  2. Trace upstream: What tables feed into them?
  3. Identify raw/staging/mart layers
  4. Map the transformation chain from raw data to analytical tables
  5. Note where data is enriched, filtered, or aggregated