principal-data-engineer

📁 rory-data/copilot 📅 3 days ago

总安装量

周安装量

#66736

全站排名

安装命令

npx skills add https://github.com/rory-data/copilot --skill principal-data-engineer

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

github-copilot 1

claude-code 1

Skill 文档

Principal Data Engineer

Overview

This skill provides the strategic and technical depth expected of a Principal Data Engineer. It moves beyond “making it work” to “making it scale, endure, and deliver value.” Use this skill for architectural decisions, high-stakes code reviews, and establishing robust engineering patterns.

Core Capabilities

1. Data Platform Architecture

Focus on the “-ilities”: Scalability, Reliability, Maintainability, and Observability.

Design for Failure: Assume every component will fail. Build retries, dead-letter queues, and circuit breakers.
Idempotency: All pipelines must be re-runnable without side effects.
Decoupling: Separate compute from storage; separate orchestration (Airflow) from execution (Spark/Snowflake/dbt).
Cost Awareness: Design schemas and compute usage (e.g., partition strategies) to minimize cost at scale.

2. Pipeline Engineering Standards

Enforce strict standards for Airflow and Python code.

No Top-Level Code: Strictly adhere to Airflow best practices to prevent scheduler overload.
Idempotency: All DAG tasks must be re-runnable without side effects or data duplication.
Atomic Tasks: Each task should do one thing. If it fails, it should be clear what failed.
Functional Patterns: Prefer clear inputs and outputs over shared global state.
TaskFlow API: Use Airflow 2.0+ decorator syntax (@dag, @task) for clarity and type safety.
Testing:
- Unit: Test transform logic in isolation.
- Integration: Test DAG integrity and component connectivity with DagBag.
- Data Quality: Validate data “in-flight” (pre-condition/post-condition checks).

Critical Anti-Patterns to Avoid:

â Top-level code execution (runs on every scheduler loop)
â Non-idempotent operations (append without deduplication)
â Direct metadata database access (use Airflow’s public API)
â Hardcoded credentials (use Airflow Connections and Variables)
â Excessive dynamic DAG generation (use dynamic task mapping instead)

See airflow-best-practices.md for comprehensive patterns and examples.

3. Data Quality & Observability

Quality is not an afterthought; it is a pipeline dependency.

Data Contracts: Use datacontract-cli with ODCS to define explicit contracts between producers and consumers.
Data Quality Checks: Use Soda for declarative data quality validation integrated into pipelines.
SLA/SLO Monitoring: Alert not just on failure, but on lateness (missing SLAs).
Data Lineage: Ensure transformations are traceable from source to sink (OpenLineage, dbt docs).

4. Composable Data Stack

Leverage the composable data stackâswap any component without rewriting the entire pipeline.

Ingestion: Prefer dlt for robust, schema-aware ELT. Use dlt init <source> <destination> to scaffold pipelines.
Processing: Default to DuckDB, Polars, and Apache Arrow for single-node processing (faster/cheaper than Spark for small/medium data).
Embedded OLAP: Use DuckDB for local development, testing, and file-based querying (S3/Parquet).
Portable Code: Use Ibis to decouple transformation logic from execution engines (run on DuckDB locally, Snowflake in prod).
Open Table Format: Apache Iceberg for lakehouse architecturesâschema evolution, time-travel, partition evolution.
Transformation: dbt for SQL-first transformations with built-in testing and documentation.
Data Contracts: datacontract-cli with Open Data Contract Standard (ODCS) for producer/consumer agreements.

Usage Guidelines

When to use

Architectural Reviews: “Review this proposed architecture for the new streaming platform.”
Complex Debugging: “The scheduler is lagging, and tasks are getting stuck. Help diagnose.”
Standard Setting: “Create a template for a standardized ingestion pipeline.”

Key Questions to Ask

“Is this pipeline idempotent? What happens if I run it twice?”
“How do we backfill historical data with this design?”
“What is the recovery time objective (RTO) for this dataset?”

Resources

references/

architecture-patterns.md: Common patterns for batch and streaming architectures.
data-quality-checklist.md: A checklist for ensuring data reliability.
apache-arrow.md: Arrow as the data spineâzero-copy, polyglot interoperability, ADBC, Flight, Parquet.
single-node-vs-spark.md: When to use Polars/DuckDB vs Apache Spark. Decision trees and practical guidance.
composable-data-stack.md: Guidance on dlt, Polars, DuckDB, and Ibis.

scripts/

validate_dag_integrity.py: Utility to check for common Airflow anti-patterns (e.g., top-level code).

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台