clickhouse-query-optimization

📁 dawiddutoit/custom-claude 📅 Jan 26, 2026
4
总安装量
4
周安装量
#49057
全站排名
安装命令
npx skills add https://github.com/dawiddutoit/custom-claude --skill clickhouse-query-optimization

Agent 安装分布

mcpjam 4
neovate 4
gemini-cli 4
antigravity 4
windsurf 4
zencoder 4

Skill 文档

ClickHouse Query Optimization

Quick Start

Check your query plan:

EXPLAIN
SELECT user_id, COUNT()
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY user_id;

This shows which parts of the index are used, how many partitions are read, and the aggregation strategy.

When to Use

  • Write fast ClickHouse queries
  • Design table schemas
  • Analyze slow queries
  • Add data skipping indexes
  • Implement partitioning strategies
  • Use projections for multiple access patterns

Core Principles

1. Primary Key Design

The primary key defines sort order (not uniqueness). Order columns by low → high cardinality.

-- Good: country (low) → user_id → timestamp (high)
CREATE TABLE events (
    user_id UInt32,
    timestamp DateTime,
    country String
)
ENGINE = MergeTree()
ORDER BY (country, user_id, timestamp);

Key principle: Queries must filter on primary key prefix to use index.

-- ✅ Fast: Uses index (country first)
SELECT * FROM events WHERE country = 'US';

-- ❌ Slow: Skips index (missing country)
SELECT * FROM events WHERE user_id = 12345;

2. Data Skipping Indexes

For non-primary-key columns:

-- Numeric ranges
ALTER TABLE events ADD INDEX idx_duration session_duration TYPE minmax GRANULARITY 4;

-- Categorical (low cardinality)
ALTER TABLE events ADD INDEX idx_event_type event_type TYPE set(100) GRANULARITY 4;

-- String equality
ALTER TABLE events ADD INDEX idx_url url TYPE bloom_filter(0.01) GRANULARITY 4;

-- Substring search
ALTER TABLE logs ADD INDEX idx_message message TYPE ngrambf_v1(4, 512, 3, 0) GRANULARITY 1;

3. Partitioning for Lifecycle Management

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);

-- Drop old data instantly
ALTER TABLE events DROP PARTITION '202401';

-- Or use TTL
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;

4. Projections for Multiple Access Patterns

-- Main table sorted by user_id
CREATE TABLE events (
    user_id UInt32,
    product_id UInt32,
    timestamp DateTime
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

-- Add projection for product queries
ALTER TABLE events ADD PROJECTION proj_by_product (
    SELECT *
    ORDER BY (product_id, timestamp)
);

ALTER TABLE events MATERIALIZE PROJECTION proj_by_product;

-- Both queries now fast:
SELECT * FROM events WHERE user_id = 12345;    -- Uses main table
SELECT * FROM events WHERE product_id = 789;   -- Uses projection

5. Query Optimization

PREWHERE for Early Filtering:

SELECT user_id, event_type, properties
FROM events
PREWHERE timestamp >= '2024-01-01' AND country = 'US'  -- Small columns first
WHERE event_type IN ('purchase', 'signup');             -- Complex logic

Approximate Functions:

-- 10-100x faster, ~2% error
SELECT uniq(user_id) FROM events;                   -- vs COUNT(DISTINCT)
SELECT topK(10)(product_id) FROM events;            -- Approximate top-K
SELECT quantile(0.95)(response_time) FROM events;   -- Approximate percentile

Select Only Needed Columns:

-- Bad: Reads all columns
SELECT * FROM events WHERE user_id = 12345;

-- Good: Columnar advantage
SELECT user_id, timestamp, event_type FROM events WHERE user_id = 12345;

6. Profile and Debug

-- View execution plan
EXPLAIN SELECT COUNT() FROM events WHERE country = 'US';

-- Check performance
SELECT
    query,
    query_duration_ms,
    read_rows,
    read_bytes
FROM system.query_log
WHERE query LIKE '%events%'
ORDER BY event_time DESC
LIMIT 1;

Common Patterns

Technique Problem Solved Impact When to Use
Primary Key Design Index doesn’t cover queries Foundation Always (design first)
Data Skipping Indexes Non-primary filtering slow 10-100x After primary key
Partitioning Need to delete old data Instant deletion Time-series with retention
Projections Multiple query patterns 100-1000x Different sort orders
Query Syntax Large columns read unnecessarily 2-10x Per-query optimization
Profiling Don’t know why slow Insight When optimization unclear

Supporting Files

File Purpose
examples/examples.md Real-world optimization scenarios with metrics
references/reference.md Technical guides and decision trees

Requirements

  • ClickHouse 21.4+
  • Understanding of SQL and aggregation
  • Knowledge of query patterns

Integration Tips

  1. Design tables first (use EXPLAIN before/after)
  2. Monitor query_log (alert on > 100M rows read)
  3. Profile inserts (more indexes = slower writes)
  4. Test projections (use EXPLAIN to confirm optimizer choice)

See Also