scalable-data-schema
npx skills add https://github.com/sunnypatneedi/claude-starter-kit --skill scalable-data-schema
Agent 安装分布
Skill 文档
Scalable Data Schema Design
Design database schemas that grow with your applicationâfrom prototype to millions of users.
When to Use
Use this skill when:
- Starting a new project and designing initial data models
- Existing schema is causing performance issues
- Planning to scale from thousands to millions of records
- Migrating between database systems
- Data model needs to evolve without breaking changes
Core Principles
1. Start Simple, Plan for Complex
Prototype Phase (< 10K records):
- Optimize for development speed
- Denormalize for convenience
- Use simple indexes
- Single database instance
Growth Phase (10K – 1M records):
- Normalize to reduce redundancy
- Add strategic indexes
- Introduce read replicas
- Monitor query performance
Scale Phase (> 1M records):
- Partition large tables
- Shard by logical boundaries
- Add caching layers
- Consider polyglot persistence
2. Query Patterns Drive Schema
Anti-pattern: Design schema, then write queries Best practice: Identify queries, then design schema
Workflow:
1. List top 10 most frequent queries
2. List top 5 most expensive queries
3. Design indexes to support both
4. Benchmark with realistic data volumes
3. Plan for Evolution
Schema versioning strategy:
- Additive changes only (new columns, new tables)
- Deprecate old fields instead of deleting
- Use database migration tools (Flyway, Liquibase, Alembic)
- Version your schema in git
Design Patterns
Pattern 1: Event Sourcing for Audit Trails
Use when: You need complete history of changes
-- Anti-pattern: Update in place
UPDATE users SET email = 'new@example.com' WHERE id = 123;
-- Lost: Previous email, who changed it, when
-- Better: Event log
CREATE TABLE user_events (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
event_type VARCHAR(50) NOT NULL, -- 'email_changed', 'profile_updated'
event_data JSONB NOT NULL, -- {old: 'old@...', new: 'new@...'}
created_at TIMESTAMPTZ DEFAULT NOW(),
created_by BIGINT -- Who made the change
);
CREATE INDEX idx_user_events_user_id ON user_events(user_id);
CREATE INDEX idx_user_events_created_at ON user_events(created_at DESC);
-- Current state is derived from events
CREATE VIEW current_users AS
SELECT DISTINCT ON (user_id)
user_id,
event_data->>'email' as email,
created_at as last_updated
FROM user_events
WHERE event_type = 'email_changed'
ORDER BY user_id, created_at DESC;
Trade-offs:
- â Complete audit trail
- â Time travel queries
- â More storage
- â Queries more complex
Pattern 2: Soft Deletes for Recovery
Use when: Data is valuable or deletion is rare
-- Basic soft delete
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
price DECIMAL(10,2),
deleted_at TIMESTAMPTZ, -- NULL = active, NOT NULL = deleted
deleted_by BIGINT -- Who deleted it
);
-- Index for active records only (partial index)
CREATE INDEX idx_products_active ON products(id) WHERE deleted_at IS NULL;
-- Queries
SELECT * FROM products WHERE deleted_at IS NULL; -- Active only
SELECT * FROM products WHERE deleted_at IS NOT NULL; -- Deleted only
Trade-offs:
- â Easy recovery
- â Audit compliance
- â Every query needs
WHERE deleted_at IS NULL - â Unique constraints more complex
Pattern 3: Polymorphic Associations
Use when: Multiple entity types share a relationship
-- Anti-pattern: Nullable foreign keys
CREATE TABLE comments (
id BIGSERIAL PRIMARY KEY,
text TEXT NOT NULL,
post_id BIGINT, -- Comment on post
photo_id BIGINT, -- OR comment on photo
video_id BIGINT, -- OR comment on video
-- Only one should be set, but DB can't enforce this
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Better: Polymorphic with type
CREATE TABLE comments (
id BIGSERIAL PRIMARY KEY,
text TEXT NOT NULL,
commentable_type VARCHAR(50) NOT NULL, -- 'Post', 'Photo', 'Video'
commentable_id BIGINT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Ensure only valid types
CONSTRAINT valid_commentable_type
CHECK (commentable_type IN ('Post', 'Photo', 'Video'))
);
CREATE INDEX idx_comments_commentable
ON comments(commentable_type, commentable_id);
-- Query
SELECT * FROM comments
WHERE commentable_type = 'Post'
AND commentable_id = 123;
Trade-offs:
- â Flexible, extensible
- â Can’t use foreign key constraints
- â Joins more complex
When to avoid: Use separate tables if types have very different fields.
Pattern 4: Hierarchical Data (Adjacency List vs Nested Sets)
Adjacency List (simple, common):
CREATE TABLE categories (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
parent_id BIGINT REFERENCES categories(id),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_categories_parent ON categories(parent_id);
-- Get immediate children (fast)
SELECT * FROM categories WHERE parent_id = 5;
-- Get all descendants (requires recursive CTE - slower)
WITH RECURSIVE subcategories AS (
SELECT * FROM categories WHERE id = 5
UNION ALL
SELECT c.* FROM categories c
JOIN subcategories s ON c.parent_id = s.id
)
SELECT * FROM subcategories;
Nested Sets (fast reads, slow writes):
CREATE TABLE categories (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
lft INT NOT NULL, -- Left boundary
rgt INT NOT NULL, -- Right boundary
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_categories_lft_rgt ON categories(lft, rgt);
-- Get all descendants (fast!)
SELECT * FROM categories
WHERE lft > 10 AND rgt < 20;
-- Get path to root (fast!)
SELECT * FROM categories c1
JOIN categories c2 ON c1.lft BETWEEN c2.lft AND c2.rgt
WHERE c1.id = 15
ORDER BY c2.lft;
Choose:
- Adjacency List: Frequent updates, shallow hierarchies
- Nested Sets: Mostly reads, deep hierarchies
- Materialized Path: Mix of both (store path as string:
/1/5/12/)
Pattern 5: Time-Series Data (Partitioning)
Use when: Data grows continuously over time
-- Basic time-series table
CREATE TABLE metrics (
id BIGSERIAL,
metric_name VARCHAR(100) NOT NULL,
value DOUBLE PRECISION NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL,
tags JSONB,
PRIMARY KEY (id, recorded_at) -- Include partition key
) PARTITION BY RANGE (recorded_at);
-- Create partitions (monthly)
CREATE TABLE metrics_2026_01 PARTITION OF metrics
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE metrics_2026_02 PARTITION OF metrics
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
-- Index on each partition (automatically created on partition key)
CREATE INDEX idx_metrics_2026_01_name ON metrics_2026_01(metric_name);
CREATE INDEX idx_metrics_2026_02_name ON metrics_2026_02(metric_name);
-- Queries automatically use correct partition
SELECT AVG(value) FROM metrics
WHERE recorded_at BETWEEN '2026-01-15' AND '2026-01-20'
AND metric_name = 'cpu_usage';
Benefits:
- â Query only relevant partitions (partition pruning)
- â Drop old partitions easily (instant delete)
- â Parallel queries across partitions
- â Manage storage per partition
Indexing Strategy
When to Index
Index when:
- Column used in WHERE clause frequently
- Column used in JOIN conditions
- Column used in ORDER BY
- Foreign keys (even if DB doesn’t auto-create)
Don’t index when:
- Table is small (< 1000 rows)
- Column has low cardinality (few distinct values)
- Column is updated frequently
- Index would be larger than table
Index Types
B-Tree (default, most common):
CREATE INDEX idx_users_email ON users(email);
-- Use for: =, <, >, <=, >=, BETWEEN, IN, LIKE 'prefix%'
Hash (exact matches only):
CREATE INDEX idx_users_email ON users USING HASH(email);
-- Use for: = only (faster than B-Tree for equality)
GIN (Generalized Inverted Index – for arrays/JSONB):
CREATE INDEX idx_products_tags ON products USING GIN(tags);
-- Use for: JSONB @>, arrays && or @> operators
Partial Index (index subset of rows):
CREATE INDEX idx_active_users ON users(email) WHERE deleted_at IS NULL;
-- Smaller, faster for common queries
Composite Index (multiple columns):
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
-- Order matters! Works for:
-- WHERE user_id = X
-- WHERE user_id = X ORDER BY created_at DESC
-- Doesn't help:
-- WHERE created_at = Y (user_id not in query)
Index Maintenance
-- Find unused indexes (PostgreSQL)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%_pkey';
-- Find duplicate indexes
SELECT tablename, array_agg(indexname)
FROM pg_indexes
GROUP BY tablename, indexdef
HAVING COUNT(*) > 1;
-- Rebuild bloated indexes
REINDEX INDEX CONCURRENTLY idx_users_email;
Schema Evolution
Additive-Only Migrations
Safe (zero downtime):
-- Add new column with default
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Add new table
CREATE TABLE user_preferences (...);
-- Add new index concurrently (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_users_created ON users(created_at);
Risky (requires downtime or careful planning):
-- Drop column (breaks old code)
ALTER TABLE users DROP COLUMN old_field;
-- Rename column (breaks old code)
ALTER TABLE users RENAME COLUMN old_name TO new_name;
-- Change column type (may lock table)
ALTER TABLE users ALTER COLUMN age TYPE BIGINT;
Safe Migration Pattern
3-Step Deploy:
-- Step 1: Add new column (deploy, code ignores it)
ALTER TABLE users ADD COLUMN email_new VARCHAR(255);
-- Step 2: Backfill + dual write (deploy, code writes both)
UPDATE users SET email_new = email WHERE email_new IS NULL;
-- App code now writes to both email and email_new
-- Step 3: Switch over (deploy, code uses only email_new)
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_new TO email;
Versioned Schema
-- Track migrations
CREATE TABLE schema_migrations (
version VARCHAR(50) PRIMARY KEY,
applied_at TIMESTAMPTZ DEFAULT NOW()
);
-- Example migration file: 20260121_add_user_phone.sql
BEGIN;
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
INSERT INTO schema_migrations (version)
VALUES ('20260121_add_user_phone');
COMMIT;
Database Selection Guide
SQL (Relational)
PostgreSQL:
- â Best for: Complex queries, ACID transactions, JSONB
- â Advanced features: Full-text search, geospatial (PostGIS)
- Use when: You need strong consistency and rich query capabilities
MySQL/MariaDB:
- â Best for: Read-heavy workloads, replication
- â Widely supported, mature ecosystem
- Use when: You need simple replication or hosting constraints
SQLite:
- â Best for: Embedded databases, single-user apps
- Use when: No separate DB server needed
NoSQL
MongoDB (Document):
- â Best for: Flexible schema, rapid iteration
- â Schema-less (actually schema-on-read)
- Use when: Schema changes frequently, hierarchical data
Redis (Key-Value):
- â Best for: Caching, sessions, pub/sub
- Use when: Ultra-fast reads, temporary data
DynamoDB (Key-Value):
- â Best for: Serverless, predictable performance
- Use when: AWS ecosystem, pay-per-request pricing
Cassandra (Wide-Column):
- â Best for: Time-series, write-heavy workloads
- Use when: Multi-datacenter, massive scale
Decision Matrix
| Use Case | Database | Why |
|---|---|---|
| User accounts, transactions | PostgreSQL | ACID, relations |
| Product catalog (e-commerce) | PostgreSQL | Complex queries, inventory |
| Session storage | Redis | Fast, ephemeral |
| Activity feed | Cassandra | Write-heavy, time-series |
| Analytics events | ClickHouse | Columnar, OLAP |
| Document storage | MongoDB | Flexible schema |
| Geospatial queries | PostgreSQL + PostGIS | Best geo support |
Performance Optimization
Query Analysis
-- Analyze query plan (PostgreSQL)
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2026-01-01'
GROUP BY u.id, u.name;
-- Look for:
-- "Seq Scan" on large tables â Add index
-- "Hash Join" instead of "Nested Loop" â May need different index
-- High "actual time" â Slow query
Common Fixes
Problem: Slow JOIN on large tables
-- Before
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'pending';
-- Fix: Index foreign key
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
Problem: Slow COUNT(*) on large table
-- Before (scans entire table)
SELECT COUNT(*) FROM orders;
-- Fix: Use estimate for large tables
SELECT reltuples::BIGINT AS estimate
FROM pg_class
WHERE relname = 'orders';
-- Or: Maintain counter
CREATE TABLE table_stats (
table_name VARCHAR(50) PRIMARY KEY,
row_count BIGINT,
updated_at TIMESTAMPTZ
);
Problem: N+1 queries
-- Before (1 query + N queries for each user)
users = query("SELECT * FROM users LIMIT 10")
for user in users:
orders = query("SELECT * FROM orders WHERE user_id = ?", user.id)
-- Fix: Eager load with JOIN or IN
SELECT u.*, o.* FROM users u
LEFT JOIN orders o ON u.id = o.user_id
LIMIT 10;
-- Or use IN for large sets
user_ids = [1, 2, 3, ...]
SELECT * FROM orders WHERE user_id IN (?, ?, ?)
Anti-Patterns to Avoid
| Anti-Pattern | Why It’s Bad | Better Approach |
|---|---|---|
| ENUM in column type | Hard to change | Lookup table with foreign key |
| Storing arrays as strings | Can’t query efficiently | Array column or join table |
| GUID/UUID as primary key | Fragmented indexes, 16 bytes | BIGSERIAL (8 bytes, sequential) |
| Premature optimization | Complexity without benefit | Start simple, optimize when needed |
| Missing foreign keys | Data integrity issues | Always use FK constraints |
| No created_at/updated_at | Can’t audit or debug | Add to all tables |
| Over-normalization | Too many JOINs | Denormalize for read patterns |
| Under-normalization | Data duplication, update anomalies | Normalize until 3NF, then denormalize strategically |
Output Format
When helping with schema design:
## Schema Analysis
### Current Pain Points
- [Specific issue 1]
- [Specific issue 2]
### Recommended Schema
[SQL DDL for new tables/changes]
### Indexes
[Recommended indexes with rationale]
### Migration Strategy
[Step-by-step migration plan]
### Trade-offs
- â Advantages
- â Disadvantages
### Expected Performance Impact
- Reads: [faster/slower, by how much]
- Writes: [faster/slower, by how much]
- Storage: [increase/decrease]
Integration
Works with:
- database-schema – Initial schema design
- api-design – Align schema with API needs
- systems-decompose – Feature-driven schema design
- data-infrastructure-at-scale – Infrastructure for scaled schemas