data engineering
1
总安装量
0
周安装量
#49270
全站排名
安装命令
npx skills add https://github.com/majiayu000/claude-skill-registry --skill Data Engineering
Skill 文档
Data Engineering Skill
Quick Reference
| Role | Focus | Timeline | Entry From |
|---|---|---|---|
| Data Engineer | Pipelines, Infra | 12-24 mo | Backend Dev |
| ML Engineer | Models, Features | 12-24 mo | Data Scientist |
| AI Engineer | LLMs, Agents | 6-12 mo | Any Developer |
Learning Paths
Data Engineer
[1] SQL Mastery (4-6 wk)
â ââ Window functions, CTEs, optimization
â
â¼
[2] Python for Data (4-6 wk)
â ââ Pandas, file formats, scripting
â
â¼
[3] ETL/ELT Pipelines (6-8 wk)
â ââ Extract, transform, load patterns
â
â¼
[4] Big Data: Spark (8-12 wk)
â ââ PySpark, DataFrames, partitioning
â
â¼
[5] Data Warehouse (4-6 wk)
â ââ Star schema, dbt, Snowflake/BQ
â
â¼
[6] Orchestration (4-6 wk)
ââ Airflow/Prefect, scheduling, monitoring
2025 Stack: Python + Spark + Airflow + dbt + Snowflake/BigQuery
ML Engineer
[1] Python + NumPy (4-6 wk)
â
â¼
[2] Math Foundations (6-8 wk)
â ââ Linear algebra, calculus, statistics
â
â¼
[3] Classical ML (8-12 wk)
â ââ scikit-learn, XGBoost, evaluation
â
â¼
[4] Deep Learning (8-12 wk)
â ââ PyTorch, CNNs, Transformers
â
â¼
[5] MLOps (6-8 wk)
ââ MLflow, model serving, monitoring
2025 Stack: Python + PyTorch + scikit-learn + MLflow + W&B
AI Engineer (2025 Hot Path)
[1] LLM Fundamentals (2-3 wk)
â ââ Tokens, embeddings, context windows
â
â¼
[2] Prompt Engineering (2-3 wk)
â ââ Few-shot, CoT, structured output
â
â¼
[3] RAG Systems (3-4 wk)
â ââ Embeddings, vector DBs, retrieval
â
â¼
[4] AI Agents (4-6 wk)
â ââ Tool calling, agent loops, memory
â
â¼
[5] Production Deploy (ongoing)
ââ Evaluation, guardrails, monitoring
2025 Stack: Python + LangChain/LlamaIndex + OpenAI/Anthropic + ChromaDB
2025 Tool Matrix
Data Processing
| Tool | Scale | Use Case |
|---|---|---|
| Pandas | <10GB | Prototyping, small data |
| Polars | <100GB | Fast local processing |
| Spark | >100GB | Distributed processing |
| dbt | Any | Transformations, testing |
ML Frameworks
| Framework | Best For | Complexity |
|---|---|---|
| scikit-learn | Classical ML | Low |
| XGBoost | Tabular data | Low |
| PyTorch | Research, flexibility | Medium |
| TensorFlow | Production, mobile | Medium |
LLM/AI Tools
| Tool | Use Case |
|---|---|
| LangChain | LLM orchestration |
| LlamaIndex | RAG systems |
| Claude/OpenAI | LLM APIs |
| ChromaDB | Vector storage |
Algorithm Reference
Classical ML
| Type | Algorithms |
|---|---|
| Regression | Linear, Ridge, Lasso, ElasticNet |
| Classification | Logistic, SVM, Decision Tree |
| Ensemble | Random Forest, XGBoost, LightGBM |
| Clustering | K-Means, DBSCAN, Hierarchical |
Deep Learning
| Architecture | Use Case |
|---|---|
| CNN | Images, vision |
| RNN/LSTM | Sequences |
| Transformer | NLP, LLMs |
| Diffusion | Image generation |
AI Agent Architecture (2025)
âââââââââââââââââââââââââââââââââââââââââââ
â AGENTIC LOOP â
âââââââââââââââââââââââââââââââââââââââââââ¤
â PERCEIVE â REASON â ACT â REFLECT â
â â â â â â
â â â â ââ⺠Loop â
â â â ââ⺠Execute toolsâ
â â ââ⺠LLM decides action â
â ââ⺠Gather context, observations â
âââââââââââââââââââââââââââââââââââââââââââ
Design Patterns (Anthropic 2025):
⢠Prompt Chaining - Sequential fixed steps
⢠Routing - Classify and dispatch
⢠Parallelization - Concurrent subtasks
⢠Orchestrator-Workers - Central delegation
⢠Evaluator-Optimizer - Generate + critique
Troubleshooting
Which path to choose?
ââ⺠Love building infrastructure? â Data Engineer
ââ⺠Love algorithms/math? â ML Engineer
ââ⺠Want fastest AI entry? â AI Engineer
ââ⺠Uncertain? â Start with Python + SQL
Model not performing well?
ââ⺠Data quality issues? â Clean data first
ââ⺠Feature engineering? â Create better features
ââ⺠Wrong algorithm? â Try different models
ââ⺠Overfitting? â More data, regularization
ââ⺠Hyperparameters? â Grid/random search
LLM giving bad answers?
ââ⺠Prompt too vague? â Be more specific
ââ⺠Missing context? â Add relevant info
ââ⺠Hallucinating? â Use RAG, verify facts
ââ⺠Wrong tool? â Improve tool descriptions
Common Failure Modes
| Symptom | Root Cause | Recovery |
|---|---|---|
| Model fails in prod | Data drift | Monitor distributions |
| Pipeline always late | Unoptimized queries | Profile, partition |
| RAG finds wrong docs | Bad chunking | Tune chunk size, overlap |
| Agent loops forever | No exit condition | Add max iterations |
Portfolio Projects
Data Engineering
- ETL Pipeline (Airflow + dbt)
- Real-time Streaming (Kafka + Spark)
- Data Warehouse Design
ML Engineering
- Classification Model (scikit-learn)
- Deep Learning Model (PyTorch)
- ML Pipeline (MLflow)
AI Engineering
- RAG Chatbot (LangChain + ChromaDB)
- AI Agent with Tools
- Multi-Agent System
Next Actions
Specify your target role for a detailed learning plan.