principal-data-engineer

📁 rory-data/copilot 📅 3 days ago
1
总安装量
1
周安装量
#45641
全站排名
安装命令
npx skills add https://github.com/rory-data/copilot --skill principal-data-engineer

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
codex 1
github-copilot 1
claude-code 1

Skill 文档

Principal Data Engineer

Overview

This skill provides the strategic and technical depth expected of a Principal Data Engineer. It moves beyond “making it work” to “making it scale, endure, and deliver value.” Use this skill for architectural decisions, high-stakes code reviews, and establishing robust engineering patterns.

Core Capabilities

1. Data Platform Architecture

Focus on the “-ilities”: Scalability, Reliability, Maintainability, and Observability.

  • Design for Failure: Assume every component will fail. Build retries, dead-letter queues, and circuit breakers.
  • Idempotency: All pipelines must be re-runnable without side effects.
  • Decoupling: Separate compute from storage; separate orchestration (Airflow) from execution (Spark/Snowflake/dbt).
  • Cost Awareness: Design schemas and compute usage (e.g., partition strategies) to minimize cost at scale.

2. Pipeline Engineering Standards

Enforce strict standards for Airflow and Python code.

  • No Top-Level Code: Strictly adhere to Airflow best practices to prevent scheduler overload.
  • Idempotency: All DAG tasks must be re-runnable without side effects or data duplication.
  • Atomic Tasks: Each task should do one thing. If it fails, it should be clear what failed.
  • Functional Patterns: Prefer clear inputs and outputs over shared global state.
  • TaskFlow API: Use Airflow 2.0+ decorator syntax (@dag, @task) for clarity and type safety.
  • Testing:
    • Unit: Test transform logic in isolation.
    • Integration: Test DAG integrity and component connectivity with DagBag.
    • Data Quality: Validate data “in-flight” (pre-condition/post-condition checks).

Critical Anti-Patterns to Avoid:

  • ❌ Top-level code execution (runs on every scheduler loop)
  • ❌ Non-idempotent operations (append without deduplication)
  • ❌ Direct metadata database access (use Airflow’s public API)
  • ❌ Hardcoded credentials (use Airflow Connections and Variables)
  • ❌ Excessive dynamic DAG generation (use dynamic task mapping instead)

See airflow-best-practices.md for comprehensive patterns and examples.

3. Data Quality & Observability

Quality is not an afterthought; it is a pipeline dependency.

  • Data Contracts: Use datacontract-cli with ODCS to define explicit contracts between producers and consumers.
  • Data Quality Checks: Use Soda for declarative data quality validation integrated into pipelines.
  • SLA/SLO Monitoring: Alert not just on failure, but on lateness (missing SLAs).
  • Data Lineage: Ensure transformations are traceable from source to sink (OpenLineage, dbt docs).

4. Composable Data Stack

Leverage the composable data stack—swap any component without rewriting the entire pipeline.

  • Ingestion: Prefer dlt for robust, schema-aware ELT. Use dlt init <source> <destination> to scaffold pipelines.
  • Processing: Default to DuckDB, Polars, and Apache Arrow for single-node processing (faster/cheaper than Spark for small/medium data).
  • Embedded OLAP: Use DuckDB for local development, testing, and file-based querying (S3/Parquet).
  • Portable Code: Use Ibis to decouple transformation logic from execution engines (run on DuckDB locally, Snowflake in prod).
  • Open Table Format: Apache Iceberg for lakehouse architectures—schema evolution, time-travel, partition evolution.
  • Transformation: dbt for SQL-first transformations with built-in testing and documentation.
  • Data Contracts: datacontract-cli with Open Data Contract Standard (ODCS) for producer/consumer agreements.

Usage Guidelines

When to use

  • Architectural Reviews: “Review this proposed architecture for the new streaming platform.”
  • Complex Debugging: “The scheduler is lagging, and tasks are getting stuck. Help diagnose.”
  • Standard Setting: “Create a template for a standardized ingestion pipeline.”

Key Questions to Ask

  • “Is this pipeline idempotent? What happens if I run it twice?”
  • “How do we backfill historical data with this design?”
  • “What is the recovery time objective (RTO) for this dataset?”

Resources

references/

scripts/