data-model-design
npx skills add https://github.com/hollandkevint/data-product-operator --skill data-model-design
Agent 安装分布
Skill 文档
Star Schema as Default
For analytics data products, default to star schema with strategic denormalization:
Fact tables contain events at the lowest useful grain:
- One row per event (transaction, visit, measurement, interaction)
- Foreign keys to dimension tables
- Numeric measures (amount, count, duration)
- Timestamps at the grain of the analysis
Dimension tables contain context:
- Descriptive attributes for filtering and grouping
- Human-readable labels alongside codes
- Hierarchies for drill-down (region -> state -> city)
NEVER fully denormalize into One Big Table. Many-to-many relationships cause exponential row growth. A patient with 10 conditions and 5 medications creates 50 rows instead of 15.
ALWAYS start with the query patterns. What questions will consumers ask? Design the schema to make those queries simple. If 80% of queries filter by date and group by category, those should be the primary dimensions.
Slowly Changing Dimensions (SCD)
Type 1 – Overwrite the old value. Use when history doesn’t matter (correcting a typo in a name).
Type 2 – Add a new row with effective dates. Use when you need to track what was true at a point in time (patient address at time of visit, product price at time of sale). Add effective_start_date, effective_end_date, and is_current flag.
Type 3 – Add a column for the previous value. Use when you only need one level of history (current_category, previous_category).
Default to Type 2 for any dimension where the business asks “what was it at the time of X?” Start with Type 1 for everything else and upgrade when the need emerges.
Architecture Decision Records
For every non-obvious modeling decision, write a lightweight ADR:
- Context: What situation prompted the decision?
- Options: What alternatives were considered?
- Decision: What was chosen and why?
- Consequences: What are the tradeoffs?
Keep ADRs in the repo alongside the schema. Future team members will ask “why is it modeled this way?” The ADR answers before they have to ask.
Design Validation
Before building, validate with limited data:
- Load a representative sample (1-5% of volume)
- Run the top 10 expected queries
- Verify results match expectations
- Check query performance against SLA
- Confirm grain is correct (no unexpected row multiplication)
Source Schema Evaluation
You probably won’t design the OLTP schema. But you’ll need to evaluate whether it’s a reliable source for your data product.
Quick checklist before building on a source system:
- Primary keys defined and enforced (not just implied by convention)
- Foreign keys constrained at the database level (not just in the application)
- Indexes on FK columns (missing indexes = slow extracts)
- NOT NULL on required fields (nullable everything = garbage in)
- Migrations are reversible (no destructive ALTER TABLE without a rollback plan)
Anti-patterns to flag immediately:
VARCHAR(255)on every string column (signals no thought about data types)FLOATorDOUBLEfor money (useDECIMALor integer cents)- Missing FK constraints (“the app handles it” means orphaned rows)
- Dates stored as strings (“2024-01-15” in a
VARCHARbreaks sorting and comparison)
See schema-patterns.md for common dimensional modeling patterns.