data engineering

📁 majiayu000/claude-skill-registry 📅 Jan 1, 1970

总安装量

周安装量

#76758

全站排名

安装命令

npx skills add https://github.com/majiayu000/claude-skill-registry --skill Data Engineering

Skill 文档

Data Engineering Skill

Quick Reference

Role	Focus	Timeline	Entry From
Data Engineer	Pipelines, Infra	12-24 mo	Backend Dev
ML Engineer	Models, Features	12-24 mo	Data Scientist
AI Engineer	LLMs, Agents	6-12 mo	Any Developer

Learning Paths

Data Engineer

[1] SQL Mastery (4-6 wk)
 â  ââ Window functions, CTEs, optimization
 â
 â¼
[2] Python for Data (4-6 wk)
 â  ââ Pandas, file formats, scripting
 â
 â¼
[3] ETL/ELT Pipelines (6-8 wk)
 â  ââ Extract, transform, load patterns
 â
 â¼
[4] Big Data: Spark (8-12 wk)
 â  ââ PySpark, DataFrames, partitioning
 â
 â¼
[5] Data Warehouse (4-6 wk)
 â  ââ Star schema, dbt, Snowflake/BQ
 â
 â¼
[6] Orchestration (4-6 wk)
    ââ Airflow/Prefect, scheduling, monitoring

2025 Stack: Python + Spark + Airflow + dbt + Snowflake/BigQuery

ML Engineer

[1] Python + NumPy (4-6 wk)
 â
 â¼
[2] Math Foundations (6-8 wk)
 â  ââ Linear algebra, calculus, statistics
 â
 â¼
[3] Classical ML (8-12 wk)
 â  ââ scikit-learn, XGBoost, evaluation
 â
 â¼
[4] Deep Learning (8-12 wk)
 â  ââ PyTorch, CNNs, Transformers
 â
 â¼
[5] MLOps (6-8 wk)
    ââ MLflow, model serving, monitoring

2025 Stack: Python + PyTorch + scikit-learn + MLflow + W&B

AI Engineer (2025 Hot Path)

[1] LLM Fundamentals (2-3 wk)
 â  ââ Tokens, embeddings, context windows
 â
 â¼
[2] Prompt Engineering (2-3 wk)
 â  ââ Few-shot, CoT, structured output
 â
 â¼
[3] RAG Systems (3-4 wk)
 â  ââ Embeddings, vector DBs, retrieval
 â
 â¼
[4] AI Agents (4-6 wk)
 â  ââ Tool calling, agent loops, memory
 â
 â¼
[5] Production Deploy (ongoing)
    ââ Evaluation, guardrails, monitoring

2025 Stack: Python + LangChain/LlamaIndex + OpenAI/Anthropic + ChromaDB

2025 Tool Matrix

Data Processing

Tool	Scale	Use Case
Pandas	<10GB	Prototyping, small data
Polars	<100GB	Fast local processing
Spark	>100GB	Distributed processing
dbt	Any	Transformations, testing

ML Frameworks

Framework	Best For	Complexity
scikit-learn	Classical ML	Low
XGBoost	Tabular data	Low
PyTorch	Research, flexibility	Medium
TensorFlow	Production, mobile	Medium

LLM/AI Tools

Tool	Use Case
LangChain	LLM orchestration
LlamaIndex	RAG systems
Claude/OpenAI	LLM APIs
ChromaDB	Vector storage

Algorithm Reference

Classical ML

Type	Algorithms
Regression	Linear, Ridge, Lasso, ElasticNet
Classification	Logistic, SVM, Decision Tree
Ensemble	Random Forest, XGBoost, LightGBM
Clustering	K-Means, DBSCAN, Hierarchical

Deep Learning

Architecture	Use Case
CNN	Images, vision
RNN/LSTM	Sequences
Transformer	NLP, LLMs
Diffusion	Image generation

AI Agent Architecture (2025)

âââââââââââââââââââââââââââââââââââââââââââ
â            AGENTIC LOOP                  â
âââââââââââââââââââââââââââââââââââââââââââ¤
â  PERCEIVE â REASON â ACT â REFLECT      â
â      â         â       â       â        â
â      â         â       â       âââº Loop â
â      â         â       âââº Execute toolsâ
â      â         âââº LLM decides action   â
â      âââº Gather context, observations   â
âââââââââââââââââââââââââââââââââââââââââââ

Design Patterns (Anthropic 2025):
â¢ Prompt Chaining - Sequential fixed steps
â¢ Routing - Classify and dispatch
â¢ Parallelization - Concurrent subtasks
â¢ Orchestrator-Workers - Central delegation
â¢ Evaluator-Optimizer - Generate + critique

Troubleshooting

Which path to choose?
âââº Love building infrastructure? â Data Engineer
âââº Love algorithms/math? â ML Engineer
âââº Want fastest AI entry? â AI Engineer
âââº Uncertain? â Start with Python + SQL

Model not performing well?
âââº Data quality issues? â Clean data first
âââº Feature engineering? â Create better features
âââº Wrong algorithm? â Try different models
âââº Overfitting? â More data, regularization
âââº Hyperparameters? â Grid/random search

LLM giving bad answers?
âââº Prompt too vague? â Be more specific
âââº Missing context? â Add relevant info
âââº Hallucinating? â Use RAG, verify facts
âââº Wrong tool? â Improve tool descriptions

Common Failure Modes

Symptom	Root Cause	Recovery
Model fails in prod	Data drift	Monitor distributions
Pipeline always late	Unoptimized queries	Profile, partition
RAG finds wrong docs	Bad chunking	Tune chunk size, overlap
Agent loops forever	No exit condition	Add max iterations

Portfolio Projects

Data Engineering

ETL Pipeline (Airflow + dbt)
Real-time Streaming (Kafka + Spark)
Data Warehouse Design

ML Engineering

Classification Model (scikit-learn)
Deep Learning Model (PyTorch)
ML Pipeline (MLflow)

AI Engineering

RAG Chatbot (LangChain + ChromaDB)
AI Agent with Tools
Multi-Agent System

Next Actions

Specify your target role for a detailed learning plan.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台