clickhouse-materialized-views
npx skills add https://github.com/obsessiondb/clickhouse-skills --skill clickhouse-materialized-views
Agent 安装分布
Skill 文档
ClickHouse Materialized Views
Load when creating Materialized Views for real-time aggregation, ETL pipelines, or query optimization.
Prerequisite: Understand MergeTree engine variants in clickhouse-schema-design skill.
Reference Documentation
Search terms: materialized view, MV, real-time aggregation, AggregateFunction, -State, -Merge, SummingMergeTree, pre-aggregation, incremental
Critical Rules
[CRITICAL]
- MVs are triggers, not caches. They process INSERT data, not query results.
- Use correct engine. AggregatingMergeTree for complex aggregates, SummingMergeTree for simple counters.
- Query with -Merge functions or argMax. Aggregation completes at query time, not insert time.
- Chained MVs see the pre-aggregated block, NOT the merged table state. A downstream MV (AâBâC) receives the block just inserted into B’s target table â already grouped by the upstream MV’s SELECT, but not merged with existing data. This means each insert batch is processed independently and
-MergeStateis required when chaining AggregatingMergeTree MVs.
[HIGH]
- MV sees INSERT only. No backfill; existing data must be inserted manually.
- ORDER BY in target must match GROUP BY in MV. Otherwise aggregation won’t work properly.
[MEDIUM]
- Avoid too many MVs on one source table when possible. Each MV adds overhead to every INSERT.
How Materialized Views Work
Source Table ââINSERTââ⺠MV Transform ââ⺠Target Table
â
ââ MV executes SELECT for each inserted block
Key insight: The MV’s SELECT query runs on each INSERT batch. Results go to the target table. The MV does NOT query historical data.
Pattern 1: Real-Time Counters (SummingMergeTree)
Best for simple sums and counts that need real-time updates.
-- Source: raw events
CREATE TABLE events (
event_time DateTime,
tenant_id UInt32,
event_type String,
user_id UInt64
) ENGINE = MergeTree()
ORDER BY (tenant_id, event_time);
-- Target: daily counters
CREATE TABLE daily_event_counts (
date Date,
tenant_id UInt32,
event_type LowCardinality(String),
event_count UInt64,
unique_users UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (tenant_id, date, event_type);
-- MV: transform inserts
CREATE MATERIALIZED VIEW daily_event_counts_mv
TO daily_event_counts AS
SELECT
toDate(event_time) AS date,
tenant_id,
event_type,
count() AS event_count,
uniq(user_id) AS unique_users -- WARNING: Not additive!
FROM events
GROUP BY date, tenant_id, event_type;
Warning: uniq() in SummingMergeTree is not accurateâsums don’t equal unique counts. Use AggregatingMergeTree for unique counts.
Pattern 2: Complex Aggregates (AggregatingMergeTree)
For accurate uniq, quantiles, or any non-additive aggregate.
-- Target: uses AggregateFunction types
CREATE TABLE user_metrics_agg (
date Date,
tenant_id UInt32,
total_events AggregateFunction(sum, UInt64),
unique_users AggregateFunction(uniq, UInt64),
p95_duration AggregateFunction(quantile(0.95), Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);
-- MV: use -State functions
CREATE MATERIALIZED VIEW user_metrics_mv
TO user_metrics_agg AS
SELECT
toDate(event_time) AS date,
tenant_id,
sumState(1) AS total_events,
uniqState(user_id) AS unique_users,
quantileState(0.95)(duration_ms) AS p95_duration
FROM events
GROUP BY date, tenant_id;
-- Query: use -Merge functions
SELECT
date,
tenant_id,
sumMerge(total_events) AS events,
uniqMerge(unique_users) AS users,
quantileMerge(0.95)(p95_duration) AS p95
FROM user_metrics_agg
WHERE tenant_id = 1
GROUP BY date, tenant_id;
Key pattern: -State to insert, -Merge to query.
Pattern 3: Data Transformation Pipeline
Transform/enrich data as it arrives.
-- Source: raw JSON logs
CREATE TABLE raw_logs (
timestamp DateTime,
raw_json String
) ENGINE = MergeTree()
ORDER BY timestamp;
-- Target: parsed structured data
CREATE TABLE parsed_logs (
timestamp DateTime,
level LowCardinality(String),
service LowCardinality(String),
message String,
trace_id String
)
ENGINE = MergeTree()
ORDER BY (service, timestamp);
-- MV: parse JSON on insert
CREATE MATERIALIZED VIEW parsed_logs_mv
TO parsed_logs AS
SELECT
timestamp,
JSONExtractString(raw_json, 'level') AS level,
JSONExtractString(raw_json, 'service') AS service,
JSONExtractString(raw_json, 'message') AS message,
JSONExtractString(raw_json, 'trace_id') AS trace_id
FROM raw_logs;
Pattern 4: Last Value Tracking (ReplacingMergeTree)
Track latest state per entity.
-- Target: latest user state
CREATE TABLE user_latest_state (
user_id UInt64,
last_seen DateTime,
last_action LowCardinality(String),
total_actions UInt64
)
ENGINE = ReplacingMergeTree(last_seen)
ORDER BY user_id;
-- MV: update on each event
CREATE MATERIALIZED VIEW user_state_mv
TO user_latest_state AS
SELECT
user_id,
max(event_time) AS last_seen,
argMax(event_type, event_time) AS last_action,
count() AS total_actions
FROM events
GROUP BY user_id;
Query with argMax pattern (avoid FINAL on large tables):
SELECT
user_id,
argMax(last_seen, last_seen) AS last_seen,
argMax(last_action, last_seen) AS last_action,
argMax(total_actions, last_seen) AS total_actions
FROM user_latest_state
WHERE user_id = 123
GROUP BY user_id;
Backfilling Existing Data
MVs don’t process existing data. Backfill manually:
-- Insert historical data into target table
INSERT INTO daily_event_counts
SELECT
toDate(event_time) AS date,
tenant_id,
event_type,
count() AS event_count,
uniq(user_id) AS unique_users
FROM events
WHERE event_time < '2024-01-01' -- Before MV was created
GROUP BY date, tenant_id, event_type;
MV Management
Check MV Status
-- List all MVs
SELECT name, engine, create_table_query
FROM system.tables
WHERE engine = 'MaterializedView';
-- Check target table size
SELECT
table,
formatReadableSize(sum(bytes_on_disk)) AS size,
sum(rows) AS rows
FROM system.parts
WHERE active AND table = 'daily_event_counts'
GROUP BY table;
Pause/Resume MV
-- Pause (stop processing inserts)
ALTER TABLE events DETACH MATERIALIZED VIEW daily_event_counts_mv;
-- Resume
ALTER TABLE events ATTACH MATERIALIZED VIEW daily_event_counts_mv;
Modify MV
MVs cannot be altered. Drop and recreate:
DROP VIEW daily_event_counts_mv;
-- Optionally truncate target: TRUNCATE TABLE daily_event_counts;
CREATE MATERIALIZED VIEW daily_event_counts_mv TO daily_event_counts AS ...;
-- Backfill if needed
Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
Using uniq() with SummingMergeTree |
Sums don’t equal uniques | Use AggregatingMergeTree with uniqState/uniqMerge |
| Forgetting argMax or -Merge | Incomplete aggregation results | Use argMax pattern for Replacing/Collapsing, -Merge for Aggregating |
| No backfill after MV creation | Missing historical data | Manually INSERT historical aggregates |
| MV on wrong table | Inserts to wrong source ignored | Ensure MV is on the table receiving INSERTs |
| Too many MVs on one source | Slow inserts | Consider fewer MVs or async processing |
| Assuming chain is cheaper than parallel | Wrong topology choice | Chain and parallel have nearly identical resource cost â choose topology based on query granularity needs, not performance |
| Chained MV reads final table state | Wrong aggregation logic, missing data | Downstream MVs see the pre-aggregated block, not merged state â use -MergeState when chaining AggregatingMergeTree |
Decision Tree
Need to aggregate data at query time?
â
ââ Yes, and data changes frequently â Query raw data
â
ââ Yes, but queries are slow â Create MV
â â
â ââ Simple sums/counts only?
â â ââ SummingMergeTree
â â
â ââ Need uniq, quantile, or complex aggregates?
â â ââ AggregatingMergeTree with -State/-Merge
â â
â ââ Need latest value per key?
â ââ ReplacingMergeTree with argMax pattern
â
ââ Need to transform/parse data on insert?
ââ MV with regular MergeTree target
Complete Example: Multi-Level Aggregation
Chain MVs when each level serves different query granularity (hourly â daily). The downstream MV sees the pre-aggregated block (output of the upstream MV’s SELECT), not the merged table state â use -MergeState to handle partial aggregation states.
Cost note: Chain (AâBâC) vs parallel (AâB, AâC) topologies have nearly identical resource consumption. The bottleneck is total data written to target tables, not the topology. Choose based on query needs, not performance assumptions.
-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY ...;
-- Level 2: Hourly aggregates
CREATE TABLE hourly_stats (
hour DateTime,
tenant_id UInt32,
events AggregateFunction(sum, UInt64),
users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, hour);
CREATE MATERIALIZED VIEW hourly_mv TO hourly_stats AS
SELECT
toStartOfHour(event_time) AS hour,
tenant_id,
sumState(1) AS events,
uniqState(user_id) AS users
FROM events GROUP BY hour, tenant_id;
-- Level 3: Daily aggregates (from hourly)
-- daily_mv receives the pre-aggregated block inserted into hourly_stats
-- (already grouped by hour/tenant with -State values), but NOT merged
-- with existing hourly_stats data. sumMergeState/uniqMergeState handles
-- this: merges partial -State values within the block, then re-wraps as -State.
CREATE TABLE daily_stats (
date Date,
tenant_id UInt32,
events AggregateFunction(sum, UInt64),
users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, date);
CREATE MATERIALIZED VIEW daily_mv TO daily_stats AS
SELECT
toDate(hour) AS date,
tenant_id,
sumMergeState(events) AS events, -- Merge partial states from block, then re-State
uniqMergeState(users) AS users
FROM hourly_stats GROUP BY date, tenant_id;
Query any level:
-- Fast daily query
SELECT date, sumMerge(events), uniqMerge(users)
FROM daily_stats WHERE tenant_id = 1 GROUP BY date;
-- Drill down to hourly
SELECT hour, sumMerge(events), uniqMerge(users)
FROM hourly_stats WHERE tenant_id = 1 AND toDate(hour) = today() GROUP BY hour;
Troubleshooting
Always ask for user confirmation before creating/modifying MVs or target tables.
Wrong Aggregation Results
Problem: Counts are too high, uniq values don’t match raw data, aggregates seem doubled
Diagnose:
-- Compare MV result vs raw query
SELECT count() FROM target_table;
SELECT count() FROM source_table WHERE <same_filters>;
-- Check for duplicate keys in target
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
HAVING rows > 1;
Solutions:
| Cause | Fix |
|---|---|
Using uniq() with SummingMergeTree |
Switch to AggregatingMergeTree with uniqState/uniqMerge |
Forgetting -Merge in query |
Always use sumMerge(), uniqMerge() for AggregatingMergeTree |
Forgetting argMax for ReplacingMergeTree |
Use argMax pattern: SELECT key, argMax(col, version) ... GROUP BY key |
| Duplicate inserts to source | Deduplicate source or use ReplacingMergeTree for target |
-- BAD: uniq in SummingMergeTree (sums don't work)
ENGINE = SummingMergeTree()
-- SELECT uniq(user_id) AS users -- Wrong!
-- GOOD: AggregatingMergeTree with State/Merge
ENGINE = AggregatingMergeTree()
-- MV: uniqState(user_id) AS users
-- Query: uniqMerge(users)
MV Not Updating / Missing Data
Problem: Target table not receiving new data, counts stuck at old values
Diagnose:
-- Check if MV is attached
SELECT name, engine FROM system.tables WHERE engine = 'MaterializedView';
-- Check target table recent data
SELECT max(date), count() FROM target_table;
-- Verify source table is receiving inserts
SELECT max(event_time), count() FROM source_table WHERE event_time > now() - INTERVAL 1 HOUR;
Solutions:
| Cause | Fix |
|---|---|
| MV detached | ALTER TABLE source ATTACH MATERIALIZED VIEW mv_name |
| MV on wrong source table | Drop MV, recreate with correct source |
| Historical data not backfilled | Manually INSERT aggregated historical data |
| Inserts going to different table | Ensure app inserts to the MV’s source table |
-- Backfill historical data
INSERT INTO target_table
SELECT
toDate(event_time) AS date,
tenant_id,
sumState(1) AS events,
uniqState(user_id) AS users
FROM source_table
WHERE event_time < '2024-01-01' -- Before MV existed
GROUP BY date, tenant_id;
Slow Inserts After Adding MV
Problem: INSERT performance degraded after creating MV, insert latency increased
Diagnose:
-- Check MVs on this table
SELECT name, as_select FROM system.tables
WHERE engine = 'MaterializedView' AND as_select LIKE '%source_table%';
-- Check MV query complexity
EXPLAIN SELECT ... FROM source_table ...; -- Use MV's SELECT query
Solutions:
| Cause | Fix |
|---|---|
| Too many MVs on one source | Consolidate MVs or use async inserts |
| Complex MV query (JOINs, heavy transforms) | Simplify MV, move complexity to query time |
| MV target table has wrong ORDER BY | Match target ORDER BY to MV’s GROUP BY |
-- Example: MV groups by (tenant_id, date)
-- Target table ORDER BY should match
CREATE TABLE target (...)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date); -- Matches GROUP BY in MV
Target Table Growing Too Large
Problem: MV target table larger than expected, not aggregating properly
Diagnose:
-- Check rows per key (should be 1 after merge for same GROUP BY)
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
ORDER BY rows DESC
LIMIT 10;
Solutions:
| Cause | Fix |
|---|---|
| ORDER BY doesn’t match GROUP BY | Recreate target with ORDER BY matching MV’s GROUP BY |
| Background merges haven’t run | Wait for automatic merge, or use argMax in query |
| Wrong engine (MergeTree instead of Summing/Aggregating) | Recreate with correct engine |
-- ORDER BY must match GROUP BY columns for proper aggregation
-- MV: GROUP BY (tenant_id, date, event_type)
-- Target: ORDER BY (tenant_id, date, event_type) -- Must match!