data-validation
npx skills add https://github.com/anthropics/knowledge-work-plugins --skill data-validation
Agent 安装分布
Skill 文档
Data Validation Skill
Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.
Pre-Delivery QA Checklist
Run through this checklist before sharing any analysis with stakeholders.
Data Quality Checks
- Source verification: Confirmed which tables/data sources were used. Are they the right ones for this question?
- Freshness: Data is current enough for the analysis. Noted the “as of” date.
- Completeness: No unexpected gaps in time series or missing segments.
- Null handling: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).
- Deduplication: Confirmed no double-counting from bad joins or duplicate source records.
- Filter verification: All WHERE clauses and filters are correct. No unintended exclusions.
Calculation Checks
- Aggregation logic: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.
- Denominator correctness: Rate and percentage calculations use the right denominator. Denominators are non-zero.
- Date alignment: Comparisons use the same time period length. Partial periods are excluded or noted.
- Join correctness: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven’t inflated counts.
- Metric definitions: Metrics match how stakeholders define them. Any deviations are noted.
- Subtotals sum: Parts add up to the whole where expected. If they don’t, explain why (e.g., overlap).
Reasonableness Checks
- Magnitude: Numbers are in a plausible range. Revenue isn’t negative. Percentages are between 0-100%.
- Trend continuity: No unexplained jumps or drops in time series.
- Cross-reference: Key numbers match other known sources (dashboards, previous reports, finance data).
- Order of magnitude: Total revenue is in the right ballpark. User counts match known figures.
- Edge cases: What happens at the boundaries? Empty segments, zero-activity periods, new entities.
Presentation Checks
- Chart accuracy: Bar charts start at zero. Axes are labeled. Scales are consistent across panels.
- Number formatting: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.
- Title clarity: Titles state the insight, not just the metric. Date ranges are specified.
- Caveat transparency: Known limitations and assumptions are stated explicitly.
- Reproducibility: Someone else could recreate this analysis from the documentation provided.
Common Data Analysis Pitfalls
Join Explosion
The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)
How to prevent:
- Always check row counts after joins
- If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
- Use
COUNT(DISTINCT a.id)instead ofCOUNT(*)when counting entities through joins
Survivorship Bias
The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
- Analyzing user behavior of “current users” misses churned users
- Looking at “companies using our product” ignores those who evaluated and left
- Studying properties of “successful” outcomes without “unsuccessful” ones
How to prevent: Ask “who is NOT in this dataset?” before drawing conclusions.
Incomplete Period Comparison
The problem: Comparing a partial period to a full period.
Examples:
- “January revenue is $500K vs. December’s $800K” — but January isn’t over yet
- “This week’s signups are down” — checked on Wednesday, comparing to a full prior week
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
Denominator Shifting
The problem: The denominator changes between periods, making rates incomparable.
Examples:
- Conversion rate improves because you changed how you count “eligible” users
- Churn rate changes because the definition of “active” was updated
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
Average of Averages
The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
- Group A: 100 users, average revenue $50
- Group B: 10 users, average revenue $200
- Wrong: Average of averages = ($50 + $200) / 2 = $125
- Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
Timezone Mismatches
The problem: Different data sources use different timezones, causing misalignment.
Examples:
- Event timestamps in UTC vs. user-facing dates in local time
- Daily rollups that use different cutoff times
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
Selection Bias in Segmentation
The problem: Segments are defined by the outcome you’re measuring, creating circular logic.
Examples:
- “Users who completed onboarding have higher retention” — obviously, they self-selected
- “Power users generate more revenue” — they became power users BY generating revenue
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
Result Sanity Checking
Magnitude Checks
For any key number in your analysis, verify it passes the “smell test”:
| Metric Type | Sanity Check |
|---|---|
| User counts | Does this match known MAU/DAU figures? |
| Revenue | Is this in the right order of magnitude vs. known ARR? |
| Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? |
| Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? |
| Averages | Is the average reasonable given what you know about the distribution? |
| Percentages | Do segment percentages sum to ~100%? |
Cross-Validation Techniques
- Calculate the same metric two different ways and verify they match
- Spot-check individual records — pick a few specific entities and trace their data manually
- Compare to known benchmarks — match against published dashboards, finance reports, or prior analyses
- Reverse engineer — if total revenue is X, does per-user revenue times user count approximately equal X?
- Boundary checks — what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible?
Red Flags That Warrant Investigation
- Any metric that changed by more than 50% period-over-period without an obvious cause
- Counts or sums that are exact round numbers (suggests a filter or default value issue)
- Rates exactly at 0% or 100% (may indicate incomplete data)
- Results that perfectly confirm the hypothesis (reality is usually messier)
- Identical values across time periods or segments (suggests the query is ignoring a dimension)
Documentation Standards for Reproducibility
Analysis Documentation Template
Every non-trivial analysis should include:
## Analysis: [Title]
### Question
[The specific question being answered]
### Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])
### Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]
### Methodology
1. [Step 1 of the analysis approach]
2. [Step 2]
3. [Step 3]
### Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]
### Key Findings
1. [Finding 1 with supporting evidence]
2. [Finding 2 with supporting evidence]
### SQL Queries
[All queries used, with comments]
### Caveats
- [Things the reader should know before acting on this]
Code Documentation
For any code (SQL, Python) that may be reused:
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%
Purpose:
Calculate monthly user retention cohorts based on first activity date.
Assumptions:
- "Active" means at least one event in the month
- Excludes test/internal accounts (user_type != 'internal')
- Uses UTC dates throughout
Output:
Cohort retention matrix with cohort_month rows and months_since_signup columns.
Values are retention rates (0-100%).
"""
Version Control for Analyses
- Save queries and code in version control (git) or a shared docs system
- Note the date of the data snapshot used
- If an analysis is re-run with updated data, document what changed and why
- Link to prior versions of recurring analyses for trend comparison