clickhouse-materialized-views

📁 dawiddutoit/custom-claude 📅 Jan 26, 2026
4
总安装量
4
周安装量
#53704
全站排名
安装命令
npx skills add https://github.com/dawiddutoit/custom-claude --skill clickhouse-materialized-views

Agent 安装分布

mcpjam 4
neovate 4
gemini-cli 4
antigravity 4
windsurf 4
zencoder 4

Skill 文档

Table of Contents

  1. Purpose
  2. Quick Start
  3. Core Concepts
  4. Instructions
  5. Examples & Patterns
  6. References
  7. Requirements

ClickHouse Materialized Views Skill

Purpose

Materialized views in ClickHouse are incremental triggers that automatically transform and aggregate incoming data in real-time. Unlike traditional databases, ClickHouse views process only new data as it arrives, enabling efficient streaming analytics, real-time dashboards, and continuous aggregation without scheduled batch jobs.

When to Use This Skill

Use when asked to:

  • Build real-time data aggregation pipelines (“create real-time metrics”)
  • Implement streaming analytics or continuous aggregation
  • Create pre-aggregated tables for fast dashboard queries
  • Set up automatic data transformation as events arrive
  • Chain multi-level aggregations (hourly → daily → monthly)

Do NOT use when:

  • Data is batch-processed infrequently (use scheduled jobs instead)
  • Aggregation logic changes often (materialized views are harder to modify)
  • Source data is small and queries are already fast

Quick Start

Create a simple real-time aggregation pipeline:

-- Step 1: Create source table (raw events)
CREATE TABLE events (
    user_id UInt32,
    event_type String,
    timestamp DateTime,
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

-- Step 2: Create destination table (aggregated)
CREATE TABLE daily_revenue (
    date Date,
    total_revenue Decimal(18, 2),
    event_count UInt64
) ENGINE = SummingMergeTree()
ORDER BY date;

-- Step 3: Create materialized view (automatic aggregation)
CREATE MATERIALIZED VIEW daily_revenue_mv TO daily_revenue AS
SELECT
    toDate(timestamp) as date,
    SUM(revenue) as total_revenue,
    COUNT() as event_count
FROM events
GROUP BY date;

-- Now every INSERT into events automatically updates daily_revenue
INSERT INTO events VALUES (12345, 'purchase', '2024-01-15 10:30:00', 99.99);
SELECT * FROM daily_revenue;  -- Result: 2024-01-15 | 99.99 | 1

Instructions

Step 1: Choose Your Aggregation Pattern

Materialized views support three main patterns depending on your use case:

Incremental Views (Most Common)

  • Process only new data as it arrives
  • Best for streaming event data and continuous updates
  • Use with SummingMergeTree for automatic deduplication

Refreshable Views

  • Recalculate entire view on a schedule (e.g., hourly)
  • Best for complex queries that need full recalculation
  • Syntax: CREATE MATERIALIZED VIEW ... REFRESH EVERY 1 HOUR AS ...

Populate-Based Views

  • Backfill existing data when view is created
  • Warning: Can be expensive on large tables
  • Syntax: CREATE MATERIALIZED VIEW ... POPULATE AS ...

See examples/aggregation-patterns.md for complete pattern examples.

Step 2: Select the Right Destination Engine

The destination table engine determines how your aggregated data behaves:

SummingMergeTree (Simple Aggregates)

  • Use when aggregating with COUNT, SUM, AVG
  • Automatically sums columns during background merges
  • No duplicate rows in final results
  • Example: Hourly sales totals, daily revenue

AggregatingMergeTree (Complex Aggregates)

  • Use with State/Merge functions for uniq, quantile, topK
  • Stores intermediate aggregation states
  • Combine with uniqState(), quantileState(), topKState() in views
  • Query with uniqMerge(), quantileMerge(), topKMerge() functions
  • Example: Unique user counts, percentile calculations

ReplacingMergeTree (Deduplication)

  • Use when you need to replace/update existing rows
  • Maintains version column to determine latest record
  • Example: User profile updates, entity state snapshots

MergeTree (Raw Events)

  • Use when storing raw events without aggregation
  • No automatic deduplication
  • Example: Event audit trail, raw clickstream data

See references/engine-comparison.md for detailed comparison.

Step 3: Design Your Aggregation Query

Follow these principles:

GROUP BY Cardinality

  • Keep cardinality moderate (not millions of unique combinations)
  • Example: Good = GROUP BY date, product_id | Bad = GROUP BY user_id, timestamp

ORDER BY for Query Patterns

  • Design ORDER BY for your most common query filters
  • If queries filter by date first: ORDER BY (date, product_id)
  • If queries filter by product first: ORDER BY (product_id, date)

Include Necessary Columns

  • Include columns needed for post-aggregation filtering
  • Store raw values alongside aggregates when needed for flexibility
  • Example: Store timestamp and revenue along with hourly aggregates

Use Conditional Aggregation

  • Use countIf(), sumIf(), avgIf() for event filtering
  • Filter within the aggregation rather than in views
  • Example: countIf(event_type = 'purchase') as purchase_count

See examples/aggregation-patterns.md for SQL patterns.

Step 4: Implement State/Merge Functions for Complex Aggregates

When using AggregatingMergeTree, use special State/Merge function pairs:

In Materialized View (State Functions)

CREATE MATERIALIZED VIEW user_stats_mv TO user_stats AS
SELECT
    toDate(timestamp) as date,
    uniqState(user_id) as unique_users,
    quantileState(0.95)(response_time) as p95_latency
FROM events
GROUP BY date;

In Queries (Merge Functions)

SELECT
    date,
    uniqMerge(unique_users) as total_unique_users,
    quantileMerge(0.95)(p95_latency) as latency_p95
FROM user_stats
GROUP BY date;

Common function pairs:

  • uniq() ↔ uniqState() / uniqMerge()
  • sum() ↔ sumState() / sumMerge()
  • avg() ↔ avgState() / avgMerge()
  • quantile(0.95)() ↔ quantileState(0.95)() / quantileMerge(0.95)()
  • topK(10)() ↔ topKState(10)() / topKMerge(10)()

See references/state-merge-reference.md for all function pairs.

Step 5: Chain Views for Multi-Level Aggregation

Build hierarchical aggregations to reduce redundant computation:

-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY (user_id, timestamp);

-- Level 2: Hourly aggregation
CREATE TABLE hourly_stats (...) ENGINE = SummingMergeTree() ORDER BY (hour, product_id);
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as event_count,
    SUM(revenue) as total_revenue
FROM events
GROUP BY hour, product_id;

-- Level 3: Daily aggregation (chain from hourly, not raw events)
CREATE TABLE daily_stats (...) ENGINE = SummingMergeTree() ORDER BY (date, product_id);
CREATE MATERIALIZED VIEW daily_stats_mv TO daily_stats AS
SELECT
    toDate(hour) as date,
    product_id,
    SUM(event_count) as event_count,
    SUM(total_revenue) as total_revenue
FROM hourly_stats
GROUP BY date, product_id;

Benefits:

  • Reduces redundant computation (don’t re-process raw events for daily stats)
  • Each level optimized for different query patterns
  • Easier to debug and modify intermediate aggregations

Step 6: Handle Schema Evolution and Versioning

When modifying aggregation logic:

-- Create new versioned view with updated schema
CREATE MATERIALIZED VIEW hourly_stats_v2_mv TO hourly_stats_v2 AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT user_id) as unique_customers  -- New column
FROM orders
GROUP BY hour, product_id;

-- Migrate queries gradually to v2
-- Then drop old view:
DROP VIEW hourly_stats_v1_mv;
DROP TABLE hourly_stats_v1;

Avoid ALTER TABLE on materialized view destinations when possible (can cause data loss).

Step 7: Monitor and Optimize View Performance

Check View Execution Performance

SELECT
    view_name,
    elapsed,
    read_rows,
    memory_usage
FROM system.query_log
WHERE query_kind = 'MaterializedView'
  AND elapsed > 1
ORDER BY elapsed DESC;

Common Performance Issues and Solutions

  • High memory usage: Reduce GROUP BY cardinality or add WHERE filters
  • Slow execution: Simplify aggregation logic or chain views instead of complex single view
  • Data lag: Increase kafka_max_block_size if consuming from Kafka

See references/troubleshooting.md for debugging guide.

Step 8: Backfill Historical Data

When adding new views to existing data:

Option 1: Direct Insert (Safest)

-- Backfill historical data directly to destination
INSERT INTO hourly_stats
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue
FROM orders
WHERE date < '2024-01-01'
GROUP BY hour, product_id;

Option 2: Use POPULATE (for empty tables)

-- Only use if destination table is empty
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats
POPULATE AS
SELECT ... FROM orders;

⚠️ Avoid POPULATE on large source tables (can be very slow).

Examples & Patterns

The following patterns are common use cases for materialized views:

Real-Time Dashboard Metrics

Pre-aggregate metrics to enable instant dashboard queries:

CREATE MATERIALIZED VIEW dashboard_metrics_mv TO dashboard_metrics AS
SELECT
    toStartOfHour(timestamp) as hour,
    COUNT() as total_events,
    uniq(user_id) as unique_users,
    countIf(event_type = 'purchase') as purchases
FROM events
GROUP BY hour;

User Segmentation

Automatically segment users based on behavior:

CREATE MATERIALIZED VIEW user_segments_mv TO user_segments AS
SELECT
    user_id,
    COUNT() as event_count,
    SUM(revenue) as lifetime_value,
    CASE
        WHEN lifetime_value > 1000 THEN 'vip'
        WHEN lifetime_value > 100 THEN 'regular'
        ELSE 'casual'
    END as segment
FROM events
GROUP BY user_id;

Multi-Destination Routing

Route data to multiple aggregations:

-- High-value orders
CREATE MATERIALIZED VIEW high_value_orders_mv TO high_value_orders AS
SELECT * FROM orders WHERE total_amount > 1000;

-- All orders
CREATE MATERIALIZED VIEW all_orders_mv TO all_orders AS
SELECT * FROM orders;

Complete Examples

For comprehensive, production-ready examples covering real-world scenarios, see the examples/ directory:

  • Real-time dashboards and metrics aggregation
  • User segmentation and cohort analysis
  • Funnel analysis and conversion tracking
  • Time-series downsampling and rollups
  • Kafka streaming pipelines with end-to-end examples

Requirements

  • ClickHouse 20.6+: Basic materialized views
  • ClickHouse 21.9+: Refreshable materialized views
  • ClickHouse 22.0+: Advanced State/Merge functions
  • Python client (for querying views): pip install clickhouse-driver
  • Kafka/Redpanda: For streaming pipelines (optional)

References

Engine Comparison

references/engine-comparison.md – Detailed comparison of ClickHouse table engines for materialized views:

  • SummingMergeTree vs AggregatingMergeTree: When to use each for aggregation
  • ReplacingMergeTree: For deduplication and entity updates
  • Engine characteristics: Memory usage, merge behavior, query performance
  • Decision matrix: Choosing the right engine for your use case

State/Merge Functions Reference

references/state-merge-reference.md – Complete reference for all State/Merge function pairs:

  • Function pair mapping (State ↔ Merge)
  • Supported aggregation functions
  • Parameter specifications and usage
  • Performance considerations
  • Examples for each function

Kafka Streaming Pipeline

examples/kafka-streaming-pipeline.md – Real-world Kafka integration patterns:

  • Consuming from Kafka with Kafka table engine
  • Chaining materialized views with Kafka sources
  • Error handling and data validation
  • Complete end-to-end pipeline example
  • Monitoring Kafka consumption in ClickHouse

Aggregation Patterns

examples/aggregation-patterns.md – SQL pattern examples and common scenarios:

  • Real-time dashboards and metrics
  • User segmentation and cohort analysis
  • Funnel analysis and conversion tracking
  • Time-series downsampling and rollups
  • Copy-paste ready SQL examples

Complex Aggregates with State/Merge

examples/state-merge-aggregates.md – Advanced aggregation techniques:

  • Using uniq, quantile, and topK in views
  • Multi-stage aggregation with State/Merge
  • Accuracy vs performance tradeoffs
  • Real-world performance tuning examples
  • Working with approximate functions

Troubleshooting Guide

references/troubleshooting.md – Comprehensive debugging guide for common issues:

  • Performance problems and optimization strategies
  • Data correctness and consistency issues
  • Schema evolution and migration challenges
  • Monitoring and health checks
  • Common error messages and solutions