data-lake-platform
26
总安装量
26
周安装量
#7688
全站排名
安装命令
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill data-lake-platform
Agent 安装分布
claude-code
18
cursor
17
gemini-cli
13
opencode
13
antigravity
12
Skill 文档
Data Lake Platform
Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.
When to Use
- Design data lake/lakehouse architecture
- Set up ingestion pipelines (batch, incremental, CDC)
- Build SQL transformation layers (SQLMesh, dbt)
- Choose table formats and catalogs (Iceberg, Delta, Hudi)
- Deploy query/serving engines (Trino, ClickHouse, DuckDB)
- Implement streaming pipelines (Kafka, Flink)
- Set up orchestration (Dagster, Airflow, Prefect)
- Add governance, lineage, data quality, and cost controls
Triage Questions
- Batch, streaming, or hybrid? What is the freshness SLO?
- Append-only vs upserts/deletes (CDC)? Is time travel required?
- Primary query pattern: BI dashboards (high concurrency), ad-hoc joins, embedded analytics?
- PII/compliance: row/column-level access, retention, audit logging?
- Platform constraints: self-hosted vs cloud, preferred engines, team strengths?
Default Baseline (Good Starting Point)
- Storage: object storage + open table format (usually Iceberg)
- Catalog: REST/Hive/Glue/Nessie/Unity (match your platform)
- Transforms: SQLMesh or dbt (pick one and standardize)
- Lake query: Trino (or Spark for heavy compute/ML workloads)
- Serving (optional): ClickHouse/StarRocks/Doris for low-latency BI
- Governance: DataHub/OpenMetadata + OpenLineage
- Orchestration: Dagster/Airflow/Prefect
Workflow
- Pick table format + catalog:
references/storage-formats.md(useassets/cross-platform/template-schema-evolution.mdandassets/cross-platform/template-partitioning-strategy.md) - Design ingestion (batch/incremental/CDC):
references/ingestion-patterns.md(useassets/cross-platform/template-ingestion-governance-checklist.mdandassets/cross-platform/template-incremental-loading.md) - Design transformations (bronze/silver/gold or data products):
references/transformation-patterns.md(useassets/cross-platform/template-data-pipeline.md) - Choose lake query vs serving engines:
references/query-engine-patterns.md - Add governance, lineage, and quality gates:
references/governance-catalog.md(useassets/cross-platform/template-data-quality-governance.mdandassets/cross-platform/template-data-quality.md) - Plan operations + cost controls:
references/operational-playbook.mdandreferences/cost-optimization.md(useassets/cross-platform/template-data-quality-backfill-runbook.mdandassets/cross-platform/template-cost-optimization.md)
Architecture Patterns
- Medallion (bronze/silver/gold):
references/architecture-patterns.md - Data mesh (domain-owned data products):
references/architecture-patterns.md - Streaming-first (Kappa):
references/streaming-patterns.md - Diagrams/mermaid snippets:
references/overview.md
Quick Start
dlt + ClickHouse
pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.py
SQLMesh + DuckDB
pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh run
Reliability and Safety
Do
- Define data contracts and owners up front
- Add quality gates (freshness, volume, schema, distribution) per tier
- Make every pipeline idempotent and re-runnable (backfills are normal)
- Treat access control and audit logging as first-class requirements
Avoid
- Skipping validation to “move fast”
- Storing PII without access controls
- Pipelines that can’t be re-run safely
- Manual schema changes without version control
Resources
| Resource | Purpose |
|---|---|
| references/overview.md | Diagrams and decision flows |
| references/architecture-patterns.md | Medallion, data mesh |
| references/ingestion-patterns.md | dlt vs Airbyte, CDC |
| references/transformation-patterns.md | SQLMesh vs dbt |
| references/storage-formats.md | Iceberg vs Delta |
| references/query-engine-patterns.md | ClickHouse, DuckDB |
| references/streaming-patterns.md | Kafka, Flink |
| references/orchestration-patterns.md | Dagster, Airflow |
| references/bi-visualization-patterns.md | Metabase, Superset |
| references/cost-optimization.md | Cost levers and maintenance |
| references/operational-playbook.md | Monitoring and incident response |
| references/governance-catalog.md | Catalog, lineage, access control |
Templates
| Template | Purpose |
|---|---|
| assets/cross-platform/template-medallion-architecture.md | Baseline bronze/silver/gold plan |
| assets/cross-platform/template-data-pipeline.md | End-to-end pipeline skeleton |
| assets/cross-platform/template-ingestion-governance-checklist.md | Source onboarding checklist |
| assets/cross-platform/template-incremental-loading.md | Incremental + backfill plan |
| assets/cross-platform/template-schema-evolution.md | Schema change rules |
| assets/cross-platform/template-cost-optimization.md | Cost control checklist |
| assets/cross-platform/template-data-quality-governance.md | Quality contracts + SLOs |
| assets/cross-platform/template-data-quality-backfill-runbook.md | Backfill incident/runbook |
Related Skills
| Skill | Purpose |
|---|---|
| ai-mlops | ML deployment |
| ai-ml-data-science | Feature engineering |
| data-sql-optimization | OLTP optimization |