ai-ml-data-science

📁 vasilyu1983/ai-agents-public 📅 Jan 23, 2026
39
总安装量
39
周安装量
#5379
全站排名
安装命令
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ai-ml-data-science

Agent 安装分布

claude-code 28
cursor 24
codex 24
opencode 22
antigravity 19

Skill 文档

Data Science Engineering Suite – Quick Reference

This skill turns raw data and questions into validated, documented models ready for production:

  • EDA workflows: Structured exploration with drift detection
  • Feature engineering: Reproducible feature pipelines with leakage prevention and train/serve parity
  • Model selection: Baselines first; strong tabular defaults; escalate complexity only when justified
  • Evaluation & reporting: Slice analysis, uncertainty, model cards, production metrics
  • SQL transformation: SQLMesh for staging/intermediate/marts layers
  • MLOps: CI/CD, CT (continuous training), CM (continuous monitoring)
  • Production patterns: Data contracts, lineage, feedback loops, streaming features

Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.


Quick Reference

Task Tool/Framework Command When to Use
EDA & Profiling Pandas, Great Expectations df.describe(), ge.validate() Initial data exploration and quality checks
Feature Engineering Pandas, Polars, Feature Stores df.transform(), Feast materialization Creating lag, rolling, categorical features
Model Training Gradient boosting, linear models, scikit-learn lgb.train(), model.fit() Strong baselines for tabular ML
Hyperparameter Tuning Optuna, Ray Tune optuna.create_study(), tune.run() Optimizing model parameters
SQL Transformation SQLMesh sqlmesh plan, sqlmesh run Building staging/intermediate/marts layers
Experiment Tracking MLflow, W&B mlflow.log_metric(), wandb.log() Versioning experiments and models
Model Evaluation scikit-learn, custom metrics metrics.roc_auc_score(), slice analysis Validating model performance

Data Lake & Lakehouse

For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform:

  • Table formats: Apache Iceberg, Delta Lake, Apache Hudi
  • Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
  • Alternative transformation: dbt (alternative to SQLMesh)
  • Ingestion: dlt, Airbyte (connectors)
  • Streaming: Apache Kafka patterns
  • Orchestration: Dagster, Airflow

This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.


Related Skills

For adjacent topics, reference:


Decision Tree: Choosing Data Science Approach

User needs ML for: [Problem Type]
  - Tabular data?
    - Small-medium (<1M rows)? -> LightGBM (fast, efficient)
    - Large and complex (>1M rows)? -> LightGBM first, then NN if needed
    - High-dim sparse (text, counts)? -> Linear models, then shallow NN

  - Time series?
    - Seasonality? -> LightGBM, then see ai-ml-timeseries
    - Long-term dependencies? -> Transformers (see ai-ml-timeseries)

  - Text or mixed modalities?
    - LLMs/Transformers -> See ai-llm

  - SQL transformations?
    - SQLMesh (staging/intermediate/marts layers)

Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.


Core Concepts (Vendor-Agnostic)

  • Problem framing: define success metrics, baselines, and decision thresholds before modeling.
  • Leakage prevention: ensure all features are available at prediction time; split by time/group when appropriate.
  • Uncertainty: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.
  • Reproducibility: version code/data/features, fix seeds, and record the environment.
  • Operational handoff: define monitoring, retraining triggers, and rollback criteria with MLOps.

Implementation Practices (Tooling Examples)

  • Track experiments and artifacts (run id, commit hash, data version).
  • Add data validation gates in pipelines (schema + distribution + freshness).
  • Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).
  • Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).

Do / Avoid

Do

  • Do start with baselines and a simple model to expose leakage and data issues early.
  • Do run slice analysis and document failure modes before recommending deployment.
  • Do keep an immutable eval set; refresh training data without contaminating evaluation.

Avoid

  • Avoid random splits for temporal or user-correlated data.
  • Avoid “metric gaming” (optimizing the number without validating business impact).
  • Avoid training on labels created after the prediction timestamp (silent future leakage).

Core Patterns (Overview)

Pattern 1: End-to-End DS Project Lifecycle

Use when: Starting or restructuring any DS/ML project.

Stages:

  1. Problem framing – Business objective, success metrics, baseline
  2. Data & feasibility – Sources, coverage, granularity, label quality
  3. EDA & data quality – Schema, missingness, outliers, leakage checks
  4. Feature engineering – Per data type with feature store integration
  5. Modelling – Baselines first, then LightGBM, then complexity as needed
  6. Evaluation – Offline metrics, slice analysis, error analysis
  7. Reporting – Model evaluation report + model card
  8. MLOps – CI/CD, CT (continuous training), CM (continuous monitoring)

Detailed guide: EDA Best Practices


Pattern 2: Feature Engineering

Use when: Designing features before modelling or during model improvement.

By data type:

  • Numeric: Standardize, handle outliers, transform skew, scale
  • Categorical: One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)
    • Feature Store Integration: Store encoders, mappings, statistics centrally
  • Text: Cleaning, TF-IDF, embeddings, simple stats
  • Time: Calendar features, recency, rolling/lag features

Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.

Detailed guide: Feature Engineering Patterns


Pattern 3: Data Contracts & Lineage

Use when: Building production ML systems with data quality requirements.

Components:

  • Contracts: Schema + ranges/nullability + freshness SLAs
  • Lineage: Track source -> feature store -> train -> serve
  • Feature store hygiene: Materialization cadence, backfill/replay, encoder versioning
  • Schema evolution: Backward/forward-compatible migrations with shadow runs

Detailed guide: Data Contracts & Lineage


Pattern 4: Model Selection & Training

Use when: Picking model families and starting experiments.

Decision guide (modern benchmarks):

  • Tabular: Start with a strong baseline (linear/logistic, then gradient boosting) and iterate based on error analysis
  • Baselines: Always implement simple baselines first (majority class, mean, naive forecast)
  • Train/val/test splits: Time-based (forecasting), group-based (user/item leakage), or random (IID)
  • Hyperparameter tuning: Start manual, then Bayesian optimization (Optuna, Ray Tune)
  • Overfitting control: Regularization, early stopping, cross-validation

Detailed guide: Modelling Patterns


Pattern 5: Evaluation & Reporting

Use when: Finalizing a model candidate or handing over to production.

Key components:

  • Metric selection: Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)
  • Threshold selection: ROC/PR curves, cost-sensitive, F1 maximization
  • Slice analysis: Performance by geography, user segments, product categories
  • Error analysis: Collect high-error examples, cluster by error type, identify systematic failures
  • Uncertainty: Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks
  • Evaluation report: 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)
  • Model card: Documentation for stakeholders (intended use, data, performance, ethics, operations)

Detailed guide: Evaluation Patterns


Pattern 6: Reproducibility & MLOps

Use when: Ensuring experiments are reproducible and production-ready.

Modern MLOps (CI/CD/CT/CM):

  • CI (Continuous Integration): Automated testing, data validation, code quality
  • CD (Continuous Delivery): Environment-specific promotion (dev -> staging -> prod), canary deployment
  • CT (Continuous Training): Drift-triggered and scheduled retraining
  • CM (Continuous Monitoring): Real-time data drift, performance, system health

Versioning:

  • Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)
  • Seeds (reproducibility), hyperparameters (experiment tracker)

Detailed guide: Reproducibility Checklist


Pattern 7: Feature Freshness & Streaming

Use when: Managing real-time features and streaming pipelines.

Components:

  • Freshness contracts: Define freshness SLAs per feature, monitor lag, alert on breaches
  • Batch + stream parity: Same feature logic across batch/stream, idempotent upserts
  • Schema evolution: Version schemas, add forward/backward-compatible parsers, backfill with rollback
  • Data quality gates: PII/format checks, range checks, distribution drift (KL, KS, PSI)

Detailed guide: Feature Freshness & Streaming


Pattern 8: Production Feedback Loops

Use when: Capturing production signals and implementing continuous improvement.

Components:

  • Signal capture: Log predictions + user edits/acceptance/abandonment (scrub PII)
  • Labeling: Route failures/edge cases to human review, create balanced sets
  • Dataset refresh: Periodic refresh (weekly/monthly) with lineage, protect eval set
  • Online eval: Shadow/canary new models, track solve rate, calibration, cost, latency

Detailed guide: Production Feedback Loops


Resources (Detailed Guides)

For comprehensive operational patterns and checklists, see:


Templates

Use these as copy-paste starting points:

Project & Workflow Templates

  • Standard DS project template: assets/project/template-standard.md
  • Quick DS experiment template: assets/project/template-quick.md

Feature Engineering & EDA

  • Feature engineering template: assets/features/template-feature-engineering.md
  • EDA checklist & notebook template: assets/eda/template-eda.md

Evaluation & Reporting

  • Model evaluation report: assets/evaluation/template-evaluation-report.md
  • Model card: assets/evaluation/template-model-card.md
  • ML experiment review: assets/review/experiment-review-template.md

SQL Transformation (SQLMesh)

For SQL-based data transformation and feature engineering:

  • SQLMesh project setup: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md
  • SQLMesh model types: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md (FULL, INCREMENTAL, VIEW)
  • Incremental models: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md
  • DAG and dependencies: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md
  • Testing and data quality: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md

Use SQLMesh when:

  • Building SQL-based feature pipelines
  • Managing incremental data transformations
  • Creating staging/intermediate/marts layers
  • Testing SQL logic with unit tests and audits

For data ingestion (loading raw data), use:

  • ai-mlops skill (dlt templates for REST APIs, databases, warehouses)

Navigation

Resources

Templates

Data


External Resources

See data/sources.json for curated foundational and implementation references:

  • Core ML/DL: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
  • Data processing: pandas, NumPy, Polars, DuckDB, Spark, Dask
  • SQL transformation: SQLMesh, dbt (staging/marts/incremental patterns)
  • Feature stores: Feast, Tecton, Databricks Feature Store (centralized feature management)
  • Data validation: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)
  • Visualization: Matplotlib, Seaborn, Plotly, Streamlit, Dash
  • MLOps: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)
  • Hyperparameter tuning: Optuna, Ray Tune, Hyperopt
  • Model serving: BentoML, FastAPI, TorchServe, Seldon, Ray Serve
  • Orchestration: Kubeflow, Metaflow, Prefect, Airflow, ZenML
  • Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake

Use this skill to execute data science projects end-to-end: concrete checklists, patterns, and templates, not theory.