clickhouse-materialized-views

📁 obsessiondb/clickhouse-plugin 📅 14 days ago
1
总安装量
1
周安装量
#54694
全站排名
安装命令
npx skills add https://github.com/obsessiondb/clickhouse-plugin --skill clickhouse-materialized-views

Agent 安装分布

mcpjam 1
claude-code 1
kilo 1
windsurf 1
zencoder 1

Skill 文档

ClickHouse Materialized Views

Load when creating Materialized Views for real-time aggregation, ETL pipelines, or query optimization.

Prerequisite: Understand MergeTree engine variants in clickhouse-schema-design skill.

Reference Documentation

Search terms: materialized view, MV, real-time aggregation, AggregateFunction, -State, -Merge, SummingMergeTree, pre-aggregation, incremental

Critical Rules

  1. MVs are triggers, not caches – They process INSERT data, not query results
  2. Use correct engine – AggregatingMergeTree for complex aggregates, SummingMergeTree for counters
  3. Query with argMax or -Merge – Aggregation completes at query time, not insert time
  4. MV sees INSERT only – No backfill; existing data must be inserted manually

How Materialized Views Work

Source Table ──INSERT──► MV Transform ──► Target Table
                │
                └─ MV executes SELECT for each inserted block

Key insight: The MV’s SELECT query runs on each INSERT batch. Results go to the target table. The MV does NOT query historical data.

Pattern 1: Real-Time Counters (SummingMergeTree)

Best for simple sums and counts that need real-time updates.

-- Source: raw events
CREATE TABLE events (
    event_time DateTime,
    tenant_id UInt32,
    event_type String,
    user_id UInt64
) ENGINE = MergeTree()
ORDER BY (tenant_id, event_time);

-- Target: daily counters
CREATE TABLE daily_event_counts (
    date Date,
    tenant_id UInt32,
    event_type LowCardinality(String),
    event_count UInt64,
    unique_users UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (tenant_id, date, event_type);

-- MV: transform inserts
CREATE MATERIALIZED VIEW daily_event_counts_mv
TO daily_event_counts AS
SELECT
    toDate(event_time) AS date,
    tenant_id,
    event_type,
    count() AS event_count,
    uniq(user_id) AS unique_users  -- WARNING: Not additive!
FROM events
GROUP BY date, tenant_id, event_type;

Warning: uniq() in SummingMergeTree is not accurate—sums don’t equal unique counts. Use AggregatingMergeTree for unique counts.

Pattern 2: Complex Aggregates (AggregatingMergeTree)

For accurate uniq, quantiles, or any non-additive aggregate.

-- Target: uses AggregateFunction types
CREATE TABLE user_metrics_agg (
    date Date,
    tenant_id UInt32,
    total_events AggregateFunction(sum, UInt64),
    unique_users AggregateFunction(uniq, UInt64),
    p95_duration AggregateFunction(quantile(0.95), Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);

-- MV: use -State functions
CREATE MATERIALIZED VIEW user_metrics_mv
TO user_metrics_agg AS
SELECT
    toDate(event_time) AS date,
    tenant_id,
    sumState(1) AS total_events,
    uniqState(user_id) AS unique_users,
    quantileState(0.95)(duration_ms) AS p95_duration
FROM events
GROUP BY date, tenant_id;

-- Query: use -Merge functions
SELECT
    date,
    tenant_id,
    sumMerge(total_events) AS events,
    uniqMerge(unique_users) AS users,
    quantileMerge(0.95)(p95_duration) AS p95
FROM user_metrics_agg
WHERE tenant_id = 1
GROUP BY date, tenant_id;

Key pattern: -State to insert, -Merge to query.

Pattern 3: Data Transformation Pipeline

Transform/enrich data as it arrives.

-- Source: raw JSON logs
CREATE TABLE raw_logs (
    timestamp DateTime,
    raw_json String
) ENGINE = MergeTree()
ORDER BY timestamp;

-- Target: parsed structured data
CREATE TABLE parsed_logs (
    timestamp DateTime,
    level LowCardinality(String),
    service LowCardinality(String),
    message String,
    trace_id String
)
ENGINE = MergeTree()
ORDER BY (service, timestamp);

-- MV: parse JSON on insert
CREATE MATERIALIZED VIEW parsed_logs_mv
TO parsed_logs AS
SELECT
    timestamp,
    JSONExtractString(raw_json, 'level') AS level,
    JSONExtractString(raw_json, 'service') AS service,
    JSONExtractString(raw_json, 'message') AS message,
    JSONExtractString(raw_json, 'trace_id') AS trace_id
FROM raw_logs;

Pattern 4: Last Value Tracking (ReplacingMergeTree)

Track latest state per entity.

-- Target: latest user state
CREATE TABLE user_latest_state (
    user_id UInt64,
    last_seen DateTime,
    last_action LowCardinality(String),
    total_actions UInt64
)
ENGINE = ReplacingMergeTree(last_seen)
ORDER BY user_id;

-- MV: update on each event
CREATE MATERIALIZED VIEW user_state_mv
TO user_latest_state AS
SELECT
    user_id,
    max(event_time) AS last_seen,
    argMax(event_type, event_time) AS last_action,
    count() AS total_actions
FROM events
GROUP BY user_id;

Query with argMax pattern (avoid FINAL on large tables):

SELECT
    user_id,
    argMax(last_seen, last_seen) AS last_seen,
    argMax(last_action, last_seen) AS last_action,
    argMax(total_actions, last_seen) AS total_actions
FROM user_latest_state
WHERE user_id = 123
GROUP BY user_id;

Backfilling Existing Data

MVs don’t process existing data. Backfill manually:

-- Insert historical data into target table
INSERT INTO daily_event_counts
SELECT
    toDate(event_time) AS date,
    tenant_id,
    event_type,
    count() AS event_count,
    uniq(user_id) AS unique_users
FROM events
WHERE event_time < '2024-01-01'  -- Before MV was created
GROUP BY date, tenant_id, event_type;

MV Management

Check MV Status

-- List all MVs
SELECT name, engine, create_table_query
FROM system.tables
WHERE engine = 'MaterializedView';

-- Check target table size
SELECT
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    sum(rows) AS rows
FROM system.parts
WHERE active AND table = 'daily_event_counts'
GROUP BY table;

Pause/Resume MV

-- Pause (stop processing inserts)
ALTER TABLE events DETACH MATERIALIZED VIEW daily_event_counts_mv;

-- Resume
ALTER TABLE events ATTACH MATERIALIZED VIEW daily_event_counts_mv;

Modify MV

MVs cannot be altered. Drop and recreate:

DROP VIEW daily_event_counts_mv;
-- Optionally truncate target: TRUNCATE TABLE daily_event_counts;
CREATE MATERIALIZED VIEW daily_event_counts_mv TO daily_event_counts AS ...;
-- Backfill if needed

Common Pitfalls

Pitfall Problem Solution
Using uniq() with SummingMergeTree Sums don’t equal uniques Use AggregatingMergeTree with uniqState/uniqMerge
Forgetting argMax or -Merge Incomplete aggregation results Use argMax pattern for Replacing/Collapsing, -Merge for Aggregating
No backfill after MV creation Missing historical data Manually INSERT historical aggregates
MV on wrong table Inserts to wrong source ignored Ensure MV is on the table receiving INSERTs
Too many MVs on one source Slow inserts Consider fewer MVs or async processing

Decision Tree

Need to aggregate data at query time?
│
├─ Yes, and data changes frequently → Query raw data
│
├─ Yes, but queries are slow → Create MV
│  │
│  ├─ Simple sums/counts only?
│  │  └─ SummingMergeTree
│  │
│  ├─ Need uniq, quantile, or complex aggregates?
│  │  └─ AggregatingMergeTree with -State/-Merge
│  │
│  └─ Need latest value per key?
│     └─ ReplacingMergeTree with argMax pattern
│
└─ Need to transform/parse data on insert?
   └─ MV with regular MergeTree target

Complete Example: Multi-Level Aggregation

-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY ...;

-- Level 2: Hourly aggregates
CREATE TABLE hourly_stats (
    hour DateTime,
    tenant_id UInt32,
    events AggregateFunction(sum, UInt64),
    users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, hour);

CREATE MATERIALIZED VIEW hourly_mv TO hourly_stats AS
SELECT
    toStartOfHour(event_time) AS hour,
    tenant_id,
    sumState(1) AS events,
    uniqState(user_id) AS users
FROM events GROUP BY hour, tenant_id;

-- Level 3: Daily aggregates (from hourly)
CREATE TABLE daily_stats (
    date Date,
    tenant_id UInt32,
    events AggregateFunction(sum, UInt64),
    users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, date);

CREATE MATERIALIZED VIEW daily_mv TO daily_stats AS
SELECT
    toDate(hour) AS date,
    tenant_id,
    sumMergeState(events) AS events,   -- Merge then re-State
    uniqMergeState(users) AS users
FROM hourly_stats GROUP BY date, tenant_id;

Query any level:

-- Fast daily query
SELECT date, sumMerge(events), uniqMerge(users)
FROM daily_stats WHERE tenant_id = 1 GROUP BY date;

-- Drill down to hourly
SELECT hour, sumMerge(events), uniqMerge(users)
FROM hourly_stats WHERE tenant_id = 1 AND toDate(hour) = today() GROUP BY hour;

Troubleshooting

Always ask for user confirmation before creating/modifying MVs or target tables.

Wrong Aggregation Results

Problem: Counts are too high, uniq values don’t match raw data, aggregates seem doubled

Diagnose:

-- Compare MV result vs raw query
SELECT count() FROM target_table;
SELECT count() FROM source_table WHERE <same_filters>;

-- Check for duplicate keys in target
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
HAVING rows > 1;

Solutions:

Cause Fix
Using uniq() with SummingMergeTree Switch to AggregatingMergeTree with uniqState/uniqMerge
Forgetting -Merge in query Always use sumMerge(), uniqMerge() for AggregatingMergeTree
Forgetting argMax for ReplacingMergeTree Use argMax pattern: SELECT key, argMax(col, version) ... GROUP BY key
Duplicate inserts to source Deduplicate source or use ReplacingMergeTree for target
-- BAD: uniq in SummingMergeTree (sums don't work)
ENGINE = SummingMergeTree()
-- SELECT uniq(user_id) AS users  -- Wrong!

-- GOOD: AggregatingMergeTree with State/Merge
ENGINE = AggregatingMergeTree()
-- MV: uniqState(user_id) AS users
-- Query: uniqMerge(users)

MV Not Updating / Missing Data

Problem: Target table not receiving new data, counts stuck at old values

Diagnose:

-- Check if MV is attached
SELECT name, engine FROM system.tables WHERE engine = 'MaterializedView';

-- Check target table recent data
SELECT max(date), count() FROM target_table;

-- Verify source table is receiving inserts
SELECT max(event_time), count() FROM source_table WHERE event_time > now() - INTERVAL 1 HOUR;

Solutions:

Cause Fix
MV detached ALTER TABLE source ATTACH MATERIALIZED VIEW mv_name
MV on wrong source table Drop MV, recreate with correct source
Historical data not backfilled Manually INSERT aggregated historical data
Inserts going to different table Ensure app inserts to the MV’s source table
-- Backfill historical data
INSERT INTO target_table
SELECT
    toDate(event_time) AS date,
    tenant_id,
    sumState(1) AS events,
    uniqState(user_id) AS users
FROM source_table
WHERE event_time < '2024-01-01'  -- Before MV existed
GROUP BY date, tenant_id;

Slow Inserts After Adding MV

Problem: INSERT performance degraded after creating MV, insert latency increased

Diagnose:

-- Check MVs on this table
SELECT name, as_select FROM system.tables
WHERE engine = 'MaterializedView' AND as_select LIKE '%source_table%';

-- Check MV query complexity
EXPLAIN SELECT ... FROM source_table ...;  -- Use MV's SELECT query

Solutions:

Cause Fix
Too many MVs on one source Consolidate MVs or use async inserts
Complex MV query (JOINs, heavy transforms) Simplify MV, move complexity to query time
MV target table has wrong ORDER BY Match target ORDER BY to MV’s GROUP BY
-- Example: MV groups by (tenant_id, date)
-- Target table ORDER BY should match
CREATE TABLE target (...)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);  -- Matches GROUP BY in MV

Target Table Growing Too Large

Problem: MV target table larger than expected, not aggregating properly

Diagnose:

-- Check rows per key (should be 1 after merge for same GROUP BY)
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
ORDER BY rows DESC
LIMIT 10;

Solutions:

Cause Fix
ORDER BY doesn’t match GROUP BY Recreate target with ORDER BY matching MV’s GROUP BY
Background merges haven’t run Wait for automatic merge, or use argMax in query
Wrong engine (MergeTree instead of Summing/Aggregating) Recreate with correct engine
-- ORDER BY must match GROUP BY columns for proper aggregation
-- MV: GROUP BY (tenant_id, date, event_type)
-- Target: ORDER BY (tenant_id, date, event_type)  -- Must match!