starrocks-expert

📁 tiansuyu/agent-skills 📅 11 days ago

总安装量

周安装量

#50186

全站排名

安装命令

npx skills add https://github.com/tiansuyu/agent-skills --skill starrocks-expert

Agent 安装分布

opencode 4

github-copilot 4

codex 4

kimi-cli 4

gemini-cli 4

amp 4

Skill 文档

StarRocks Expert

StarRocks is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.

Quick Reference

Common Operations

Check table structure:

SHOW CREATE TABLE database_name.table_name;
DESC database_name.table_name;

View partitions:

SHOW PARTITIONS FROM database_name.table_name;

Check data loading status:

SHOW LOAD WHERE label = 'your_load_label';
SHOW STREAM LOAD;

Query execution analysis:

EXPLAIN VERBOSE SELECT ...;
SHOW QUERY PROFILE '<query_id>';

Key Concepts

Table Types:

Duplicate Key: No unique constraint, best for raw data logging
Aggregate Key: Pre-aggregates data during ingestion
Unique Key: Enforces uniqueness, supports upserts
Primary Key: Similar to Unique Key with better update performance

Data Loading Methods:

Stream Load: HTTP-based, real-time loading
Broker Load: Batch loading from external storage (S3, HDFS)
Routine Load: Continuous loading from Kafka
INSERT INTO: Direct SQL insertion

Materialized Views:

Asynchronous MV: Scheduled refresh, complex transformations
Synchronous Rollup: Auto-maintained aggregations

When to Use This Skill

Use this skill when:

Creating or modifying StarRocks table schemas
Setting up data loading pipelines from S3, HDFS, or Kafka
Optimizing slow queries or analyzing query plans
Configuring partitioning and bucketing strategies
Troubleshooting data loading failures
Setting up materialized views or rollups
Managing cluster resources and configurations

Core Sections

1. Table Design

For detailed table design guidance, see TABLE.md:

Table types (Duplicate, Aggregate, Unique, Primary Key)
Partitioning strategies with date_trunc() expressions
Bucketing and distribution
Indexes (prefix, bloom filter, bitmap, ngram)

2. Infrastructure Configuration and Management

For detailed infrastructure guidance, see INFRA.md:

Cluster setup and configuration (FE/BE/Broker)
Resource management and resource groups
Monitoring and maintenance
Backup and recovery

3. Query and Task Skills

For detailed query optimization guidance, see QUERY.md:

Data loading patterns (Stream Load, Broker Load, Routine Load)
Query optimization techniques
Materialized view design
ETL pipeline patterns
Performance troubleshooting
Common anti-patterns

Table Creation Pattern

CREATE TABLE IF NOT EXISTS database_name.table_name (
    event_time DATETIME NOT NULL COMMENT 'Event timestamp',
    id BIGINT NOT NULL COMMENT 'Business ID',
    amount DECIMAL(18, 2) COMMENT 'Transaction amount',
    status VARCHAR(50) COMMENT 'Status code',
    created_at DATETIME COMMENT 'Creation timestamp'
)
DUPLICATE KEY (event_time, id)
PARTITION BY date_trunc('day', event_time)  -- Expression-based partitioning
DISTRIBUTED BY HASH(id) BUCKETS 16
PROPERTIES (
    "replication_num" = "3",
    "storage_medium" = "SSD",
    "storage_cooldown_time" = "2024-12-31 23:59:59"
);

See TABLE.md for detailed guidance on table types, partitioning strategies, and index selection.

Data Loading Pattern

Stream Load (HTTP-based)

curl --location-trusted -u user:password \
    -H "label:load_label_$(date +%s)" \
    -H "column_separator:," \
    -H "columns:date_col,id,amount,status,created_at" \
    -T data.csv \
    http://fe_host:8030/api/database_name/table_name/_stream_load

Broker Load (S3)

LOAD LABEL database_name.load_label_20240101
(
    DATA INFILE("s3://bucket/path/data_*.parquet")
    INTO TABLE table_name
    FORMAT AS "parquet"
)
WITH BROKER
(
    "aws.s3.access_key" = "your_access_key",
    "aws.s3.secret_key" = "your_secret_key",
    "aws.s3.region" = "us-east-1"
)
PROPERTIES (
    "timeout" = "3600"
);

Query Optimization Checklist

When optimizing queries:

Check if partition pruning is working (verify in EXPLAIN)
Ensure appropriate indexes exist (prefix bloom filter, bitmap)
Verify JOIN order (small table on right for broadcast join)
Check if materialized views can be used
Analyze data distribution (avoid data skew)
Review bucket count (optimal: 2-8 per BE core)
Consider using colocate groups for frequent JOINs

Partition Management

Expression-based partitioning (Modern, Recommended)

Partitions are created automatically based on data:

-- Partitions auto-created on INSERT
CREATE TABLE events (
    event_time DATETIME,
    user_id BIGINT
)
DUPLICATE KEY (event_time)
PARTITION BY date_trunc('day', event_time)  -- Auto-creates daily partitions
DISTRIBUTED BY HASH(user_id) BUCKETS 16;

-- Hourly granularity for high-frequency data
PARTITION BY date_trunc('hour', event_time)

-- Monthly for long-term storage
PARTITION BY date_trunc('month', event_time)

Composite partitioning (Multi-tenant)

CREATE TABLE metrics (
    tenant_id INT,
    dt DATETIME,
    metric_name STRING,
    value DOUBLE
)
PRIMARY KEY (tenant_id, dt, metric_name)
PARTITION BY tenant_id, date_trunc('day', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS 16;

Drop old partitions

-- Show partitions first
SHOW PARTITIONS FROM table_name;

-- Drop specific partition by name
ALTER TABLE table_name DROP PARTITION p20230101_10001;

Materialized View Pattern

CREATE MATERIALIZED VIEW mv_daily_summary
BUILD IMMEDIATE
REFRESH ASYNC START('2024-01-01 00:00:00') EVERY(INTERVAL 1 DAY)
AS
SELECT
    date_col,
    status,
    COUNT(*) as total_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount
FROM database_name.table_name
GROUP BY date_col, status;

Troubleshooting Quick Guide

Data loading fails

Check error message: SHOW LOAD WHERE label = 'your_label'
Verify data format matches column definitions
Check broker connectivity (for Broker Load)
Verify resource availability: SHOW PROC '/backends'

Query is slow

Run EXPLAIN VERBOSE to check execution plan
Verify partition pruning is active
Check for data skew in distribution
Review JOIN strategy (broadcast vs shuffle)
Consider creating materialized views

Compaction issues

-- Check compaction status
SHOW PROC '/backends';
SHOW TABLET STATUS FROM table_name;

-- Trigger manual compaction if needed
ALTER TABLE table_name COMPACT;

Best Practices

Table Design

Choose appropriate table type based on use case
Use date_trunc() expressions for time-based partitioning
Set replication_num based on cluster size (typically 3)
Bucket count = 2-8 Ã (number of BE cores)
Keep total partition count under 100k (watch composite partitions)

Query Performance

Always filter on partition key when possible
Use prefix bloom filter for high-cardinality string columns
Avoid SELECT * in production queries
Use appropriate JOIN types (prefer INNER JOIN when possible)

Data Loading

Use Stream Load for real-time, small batches (<10GB)
Use Broker Load for large batch loads from object storage
Set appropriate timeout based on data volume
Monitor load job status and handle failures

Maintenance

Enable dynamic partitioning for time-series tables
Regularly drop old partitions to manage storage
Monitor tablet distribution across BEs
Set up alerts for load failures and query timeouts

Common Anti-Patterns to Avoid

â Don’t use VARCHAR for time columns that will be partitioned â Don’t apply functions to partition columns in WHERE (breaks pruning) â Don’t create too many buckets (causes small files) â Don’t create too few buckets (causes data skew) â Don’t skip partition pruning predicates â Don’t use UNIQUE KEY if you only need deduplication (use DUPLICATE + GROUP BY) â Don’t load data without labels (prevents duplicate detection) â Don’t create composite partitions that exceed 100k total partitions

Additional Resources

Infrastructure details: See INFRA.md for cluster setup, table design patterns, and resource management
Query optimization: See QUERY.md for advanced query tuning, loading patterns, and ETL workflows

Key Configuration Properties

Table Properties

"replication_num" = "3"                      # Number of replicas
"storage_medium" = "SSD"                     # Storage type (SSD/HDD)
"enable_persistent_index" = "true"           # For primary key tables
"bloom_filter_columns" = "id,email"          # Bloom filter index
"colocate_with" = "group_name"              # Colocate tables

Session Variables

SET enable_pipeline_engine = true;           # Enable vectorized execution
SET pipeline_dop = 0;                        # Parallelism (0 = auto)
SET query_timeout = 300;                     # Query timeout in seconds
SET enable_spill = true;                     # Enable disk spill for large joins

Getting Started Workflow

When working with a new StarRocks project:

Understand the use case
- OLAP analytics vs real-time reporting
- Query patterns (aggregations, joins, filters)
- Data volume and retention requirements
Design table schema
- Choose appropriate key type (Duplicate/Aggregate/Unique/Primary)
- Define partition strategy (range by date)
- Calculate bucket count based on data volume
- Read INFRA.md for detailed guidance
Set up data pipeline
- Select loading method (Stream/Broker/Routine Load)
- Configure error handling and retry logic
- Read QUERY.md for loading patterns
Optimize queries
- Analyze query patterns
- Create materialized views for common aggregations
- Tune partition and bucket strategies
- Read QUERY.md for optimization techniques
Monitor and maintain
- Set up monitoring for load jobs and query performance
- Configure automatic partition management
- Plan for data retention and archival

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台