data-engineering
npx skills add https://github.com/legout/data-agent-skills --skill data-engineering
Agent 安装分布
Skill 文档
Data Engineering Hub
Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains.
Skill Map
| Domain | Skills | When to Use |
|---|---|---|
| Core | @data-engineering-core |
Polars, DuckDB, PyArrow fundamentals; ETL patterns; error handling; performance optimization |
| Storage | @data-engineering-storage-lakehouse |
Delta Lake, Apache Iceberg, Apache Hudi |
@data-engineering-storage-remote-access |
fsspec, pyarrow.fs, obstore; cloud access patterns | |
@data-engineering-storage-authentication |
AWS, GCP, Azure auth – IAM roles, managed identity, secrets management | |
@data-engineering-storage-formats |
Parquet optimizations, Lance, Zarr, Avro, ORC | |
| Orchestration | @data-engineering-orchestration |
Prefect, Dagster, dbt, workflow scheduling |
| Streaming | @data-engineering-streaming |
Kafka, MQTT, NATS JetStream for real-time data |
| Quality | @data-engineering-quality |
Great Expectations, Pandera for data validation |
| Observability | @data-engineering-observability |
OpenTelemetry, Prometheus for pipeline monitoring |
| AI/ML | @data-engineering-ai-ml |
Embeddings, vector databases, RAG pipelines |
| Best Practices | @data-engineering-best-practices |
Medallion architecture, partitioning, file sizing, incremental loads, schema evolution, testing |
| Catalogs | @data-engineering-catalogs |
Data catalog systems: Iceberg catalogs, DuckDB multi-source, Amundsen/DataHub/OpenMetadata |
Quick Reference: Core Stack
| Task | Recommended Tool |
|---|---|
| DataFrame operations | Polars (10-50x faster than pandas) |
| SQL analytics | DuckDB (embedded OLAP, zero-copy Arrow integration) |
| Data interchange | PyArrow (Arrow format, zero-copy transfers) |
| Cloud storage access | fsspec (universal), pyarrow.fs (Arrow-native), obstore (high-performance) |
| Lakehouse format | Delta Lake (Spark ecosystem), Iceberg (engine-agnostic), Hudi (streaming CDC) |
| Orchestration | Prefect (Pythonic flows), Dagster (asset-based), dbt (SQL transformations) |
| Validation | Pandera (lightweight), Great Expectations (enterprise) |
Getting Started
New to Data Engineering?
Start with @data-engineering-core to learn the foundational libraries and patterns.
Working with Cloud Storage?
Go to @data-engineering-storage-remote-access for fsspec, pyarrow.fs, and obstore.
Building Data Lakes?
Explore @data-engineering-storage-lakehouse for ACID table formats.
Choosing a Data Catalog?
Check @data-engineering-catalogs for Iceberg catalogs, DuckDB multi-source patterns, and tool comparisons.
Production-Grade Pipelines?
Read @data-engineering-best-practices for medallion architecture, partitioning, schema evolution, and testing strategies.
Orchestrating Pipelines?
Check @data-engineering-orchestration for Prefect, Dagster, and dbt.
Production Monitoring?
See @data-engineering-observability for tracing and metrics.
AI/ML Data Pipelines?
Visit @data-engineering-ai-ml for embeddings, vector databases, and RAG.
Principles
- Lazy evaluation: Use Polars lazy frames and DuckDB query planning for performance
- Zero-copy data transfer: Leverage Arrow format for memory efficiency
- Pushdown optimization: Filter at storage layer to minimize data transfer
- Type safety: Use explicit schemas and type hints
- Resilience: Implement retries, circuit breakers, and proper error handling
- Observability: Instrument pipelines with traces and metrics
- Security: Never hardcode credentials; use IAM roles and environment variables
Migration from Legacy Skills
This restructured suite replaces the previous split organization (data-engineering-* and remote-filesystems-*). All content has been consolidated to eliminate duplication and clarify ownership.
Legacy skill replacements:
data-engineering-coreâ@data-engineering-core(plus specific integrations)data-engineering-lakehouseâ@data-engineering-storage-lakehousedata-engineering-orchestrationâ@data-engineering-orchestrationdata-engineering-streamingâ@data-engineering-streamingdata-engineering-qualityâ@data-engineering-qualitydata-engineering-observabilityâ@data-engineering-observabilitydata-engineering-llm-pipelinesâ@data-engineering-ai-mlremote-filesystems-*â@data-engineering-storage-remote-accessand integrations
All legacy skills remain functional but are deprecated. New content should be added to the new structure only.