clickhouse-materialized-views
npx skills add https://github.com/dawiddutoit/custom-claude --skill clickhouse-materialized-views
Agent 安装分布
Skill 文档
Table of Contents
- Purpose
- Quick Start
- Core Concepts
- Instructions
- Step 1: Choose Your Aggregation Pattern
- Step 2: Select the Right Destination Engine
- Step 3: Design Your Aggregation Query
- Step 4: Implement State/Merge Functions
- Step 5: Chain Views for Multi-Level Aggregation
- Step 6: Handle Schema Evolution
- Step 7: Monitor Performance
- Step 8: Backfill Historical Data
- Examples & Patterns
- References
- Requirements
ClickHouse Materialized Views Skill
Purpose
Materialized views in ClickHouse are incremental triggers that automatically transform and aggregate incoming data in real-time. Unlike traditional databases, ClickHouse views process only new data as it arrives, enabling efficient streaming analytics, real-time dashboards, and continuous aggregation without scheduled batch jobs.
When to Use This Skill
Use when asked to:
- Build real-time data aggregation pipelines (“create real-time metrics”)
- Implement streaming analytics or continuous aggregation
- Create pre-aggregated tables for fast dashboard queries
- Set up automatic data transformation as events arrive
- Chain multi-level aggregations (hourly â daily â monthly)
Do NOT use when:
- Data is batch-processed infrequently (use scheduled jobs instead)
- Aggregation logic changes often (materialized views are harder to modify)
- Source data is small and queries are already fast
Quick Start
Create a simple real-time aggregation pipeline:
-- Step 1: Create source table (raw events)
CREATE TABLE events (
user_id UInt32,
event_type String,
timestamp DateTime,
revenue Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);
-- Step 2: Create destination table (aggregated)
CREATE TABLE daily_revenue (
date Date,
total_revenue Decimal(18, 2),
event_count UInt64
) ENGINE = SummingMergeTree()
ORDER BY date;
-- Step 3: Create materialized view (automatic aggregation)
CREATE MATERIALIZED VIEW daily_revenue_mv TO daily_revenue AS
SELECT
toDate(timestamp) as date,
SUM(revenue) as total_revenue,
COUNT() as event_count
FROM events
GROUP BY date;
-- Now every INSERT into events automatically updates daily_revenue
INSERT INTO events VALUES (12345, 'purchase', '2024-01-15 10:30:00', 99.99);
SELECT * FROM daily_revenue; -- Result: 2024-01-15 | 99.99 | 1
Instructions
Step 1: Choose Your Aggregation Pattern
Materialized views support three main patterns depending on your use case:
Incremental Views (Most Common)
- Process only new data as it arrives
- Best for streaming event data and continuous updates
- Use with
SummingMergeTreefor automatic deduplication
Refreshable Views
- Recalculate entire view on a schedule (e.g., hourly)
- Best for complex queries that need full recalculation
- Syntax:
CREATE MATERIALIZED VIEW ... REFRESH EVERY 1 HOUR AS ...
Populate-Based Views
- Backfill existing data when view is created
- Warning: Can be expensive on large tables
- Syntax:
CREATE MATERIALIZED VIEW ... POPULATE AS ...
See examples/aggregation-patterns.md for complete pattern examples.
Step 2: Select the Right Destination Engine
The destination table engine determines how your aggregated data behaves:
SummingMergeTree (Simple Aggregates)
- Use when aggregating with COUNT, SUM, AVG
- Automatically sums columns during background merges
- No duplicate rows in final results
- Example: Hourly sales totals, daily revenue
AggregatingMergeTree (Complex Aggregates)
- Use with State/Merge functions for uniq, quantile, topK
- Stores intermediate aggregation states
- Combine with
uniqState(),quantileState(),topKState()in views - Query with
uniqMerge(),quantileMerge(),topKMerge()functions - Example: Unique user counts, percentile calculations
ReplacingMergeTree (Deduplication)
- Use when you need to replace/update existing rows
- Maintains version column to determine latest record
- Example: User profile updates, entity state snapshots
MergeTree (Raw Events)
- Use when storing raw events without aggregation
- No automatic deduplication
- Example: Event audit trail, raw clickstream data
See references/engine-comparison.md for detailed comparison.
Step 3: Design Your Aggregation Query
Follow these principles:
GROUP BY Cardinality
- Keep cardinality moderate (not millions of unique combinations)
- Example: Good =
GROUP BY date, product_id| Bad =GROUP BY user_id, timestamp
ORDER BY for Query Patterns
- Design ORDER BY for your most common query filters
- If queries filter by date first:
ORDER BY (date, product_id) - If queries filter by product first:
ORDER BY (product_id, date)
Include Necessary Columns
- Include columns needed for post-aggregation filtering
- Store raw values alongside aggregates when needed for flexibility
- Example: Store
timestampandrevenuealong with hourly aggregates
Use Conditional Aggregation
- Use
countIf(),sumIf(),avgIf()for event filtering - Filter within the aggregation rather than in views
- Example:
countIf(event_type = 'purchase') as purchase_count
See examples/aggregation-patterns.md for SQL patterns.
Step 4: Implement State/Merge Functions for Complex Aggregates
When using AggregatingMergeTree, use special State/Merge function pairs:
In Materialized View (State Functions)
CREATE MATERIALIZED VIEW user_stats_mv TO user_stats AS
SELECT
toDate(timestamp) as date,
uniqState(user_id) as unique_users,
quantileState(0.95)(response_time) as p95_latency
FROM events
GROUP BY date;
In Queries (Merge Functions)
SELECT
date,
uniqMerge(unique_users) as total_unique_users,
quantileMerge(0.95)(p95_latency) as latency_p95
FROM user_stats
GROUP BY date;
Common function pairs:
uniq()âuniqState()/uniqMerge()sum()âsumState()/sumMerge()avg()âavgState()/avgMerge()quantile(0.95)()âquantileState(0.95)()/quantileMerge(0.95)()topK(10)()âtopKState(10)()/topKMerge(10)()
See references/state-merge-reference.md for all function pairs.
Step 5: Chain Views for Multi-Level Aggregation
Build hierarchical aggregations to reduce redundant computation:
-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY (user_id, timestamp);
-- Level 2: Hourly aggregation
CREATE TABLE hourly_stats (...) ENGINE = SummingMergeTree() ORDER BY (hour, product_id);
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats AS
SELECT
toStartOfHour(timestamp) as hour,
product_id,
COUNT() as event_count,
SUM(revenue) as total_revenue
FROM events
GROUP BY hour, product_id;
-- Level 3: Daily aggregation (chain from hourly, not raw events)
CREATE TABLE daily_stats (...) ENGINE = SummingMergeTree() ORDER BY (date, product_id);
CREATE MATERIALIZED VIEW daily_stats_mv TO daily_stats AS
SELECT
toDate(hour) as date,
product_id,
SUM(event_count) as event_count,
SUM(total_revenue) as total_revenue
FROM hourly_stats
GROUP BY date, product_id;
Benefits:
- Reduces redundant computation (don’t re-process raw events for daily stats)
- Each level optimized for different query patterns
- Easier to debug and modify intermediate aggregations
Step 6: Handle Schema Evolution and Versioning
When modifying aggregation logic:
-- Create new versioned view with updated schema
CREATE MATERIALIZED VIEW hourly_stats_v2_mv TO hourly_stats_v2 AS
SELECT
toStartOfHour(timestamp) as hour,
product_id,
COUNT() as sales_count,
SUM(revenue) as total_revenue,
COUNT(DISTINCT user_id) as unique_customers -- New column
FROM orders
GROUP BY hour, product_id;
-- Migrate queries gradually to v2
-- Then drop old view:
DROP VIEW hourly_stats_v1_mv;
DROP TABLE hourly_stats_v1;
Avoid ALTER TABLE on materialized view destinations when possible (can cause data loss).
Step 7: Monitor and Optimize View Performance
Check View Execution Performance
SELECT
view_name,
elapsed,
read_rows,
memory_usage
FROM system.query_log
WHERE query_kind = 'MaterializedView'
AND elapsed > 1
ORDER BY elapsed DESC;
Common Performance Issues and Solutions
- High memory usage: Reduce GROUP BY cardinality or add WHERE filters
- Slow execution: Simplify aggregation logic or chain views instead of complex single view
- Data lag: Increase
kafka_max_block_sizeif consuming from Kafka
See references/troubleshooting.md for debugging guide.
Step 8: Backfill Historical Data
When adding new views to existing data:
Option 1: Direct Insert (Safest)
-- Backfill historical data directly to destination
INSERT INTO hourly_stats
SELECT
toStartOfHour(timestamp) as hour,
product_id,
COUNT() as sales_count,
SUM(revenue) as total_revenue
FROM orders
WHERE date < '2024-01-01'
GROUP BY hour, product_id;
Option 2: Use POPULATE (for empty tables)
-- Only use if destination table is empty
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats
POPULATE AS
SELECT ... FROM orders;
â ï¸ Avoid POPULATE on large source tables (can be very slow).
Examples & Patterns
The following patterns are common use cases for materialized views:
Real-Time Dashboard Metrics
Pre-aggregate metrics to enable instant dashboard queries:
CREATE MATERIALIZED VIEW dashboard_metrics_mv TO dashboard_metrics AS
SELECT
toStartOfHour(timestamp) as hour,
COUNT() as total_events,
uniq(user_id) as unique_users,
countIf(event_type = 'purchase') as purchases
FROM events
GROUP BY hour;
User Segmentation
Automatically segment users based on behavior:
CREATE MATERIALIZED VIEW user_segments_mv TO user_segments AS
SELECT
user_id,
COUNT() as event_count,
SUM(revenue) as lifetime_value,
CASE
WHEN lifetime_value > 1000 THEN 'vip'
WHEN lifetime_value > 100 THEN 'regular'
ELSE 'casual'
END as segment
FROM events
GROUP BY user_id;
Multi-Destination Routing
Route data to multiple aggregations:
-- High-value orders
CREATE MATERIALIZED VIEW high_value_orders_mv TO high_value_orders AS
SELECT * FROM orders WHERE total_amount > 1000;
-- All orders
CREATE MATERIALIZED VIEW all_orders_mv TO all_orders AS
SELECT * FROM orders;
Complete Examples
For comprehensive, production-ready examples covering real-world scenarios, see the examples/ directory:
- Real-time dashboards and metrics aggregation
- User segmentation and cohort analysis
- Funnel analysis and conversion tracking
- Time-series downsampling and rollups
- Kafka streaming pipelines with end-to-end examples
Requirements
- ClickHouse 20.6+: Basic materialized views
- ClickHouse 21.9+: Refreshable materialized views
- ClickHouse 22.0+: Advanced State/Merge functions
- Python client (for querying views):
pip install clickhouse-driver - Kafka/Redpanda: For streaming pipelines (optional)
References
Engine Comparison
references/engine-comparison.md – Detailed comparison of ClickHouse table engines for materialized views:
- SummingMergeTree vs AggregatingMergeTree: When to use each for aggregation
- ReplacingMergeTree: For deduplication and entity updates
- Engine characteristics: Memory usage, merge behavior, query performance
- Decision matrix: Choosing the right engine for your use case
State/Merge Functions Reference
references/state-merge-reference.md – Complete reference for all State/Merge function pairs:
- Function pair mapping (State â Merge)
- Supported aggregation functions
- Parameter specifications and usage
- Performance considerations
- Examples for each function
Kafka Streaming Pipeline
examples/kafka-streaming-pipeline.md – Real-world Kafka integration patterns:
- Consuming from Kafka with Kafka table engine
- Chaining materialized views with Kafka sources
- Error handling and data validation
- Complete end-to-end pipeline example
- Monitoring Kafka consumption in ClickHouse
Aggregation Patterns
examples/aggregation-patterns.md – SQL pattern examples and common scenarios:
- Real-time dashboards and metrics
- User segmentation and cohort analysis
- Funnel analysis and conversion tracking
- Time-series downsampling and rollups
- Copy-paste ready SQL examples
Complex Aggregates with State/Merge
examples/state-merge-aggregates.md – Advanced aggregation techniques:
- Using uniq, quantile, and topK in views
- Multi-stage aggregation with State/Merge
- Accuracy vs performance tradeoffs
- Real-world performance tuning examples
- Working with approximate functions
Troubleshooting Guide
references/troubleshooting.md – Comprehensive debugging guide for common issues:
- Performance problems and optimization strategies
- Data correctness and consistency issues
- Schema evolution and migration challenges
- Monitoring and health checks
- Common error messages and solutions