implementing-mlops

📁 ancoleman/ai-design-components 📅 Jan 25, 2026

总安装量

周安装量

安装命令

npx skills add https://github.com/ancoleman/ai-design-components --skill implementing-mlops

Agent 安装分布

opencode 7

gemini-cli 7

antigravity 6

github-copilot 5

roo 5

Skill 文档

MLOps Patterns

Operationalize machine learning models from experimentation to production deployment and monitoring.

Purpose

Provide strategic guidance for ML engineers and platform teams to build production-grade ML infrastructure. Cover the complete lifecycle: experiment tracking, model registry, feature stores, deployment patterns, pipeline orchestration, and monitoring.

When to Use This Skill

Use this skill when:

Designing MLOps infrastructure for production ML systems
Selecting experiment tracking platforms (MLflow, Weights & Biases, Neptune)
Implementing feature stores for online/offline feature serving
Choosing model serving solutions (Seldon Core, KServe, BentoML, TorchServe)
Building ML pipelines for training, evaluation, and deployment
Setting up model monitoring and drift detection
Establishing model governance and compliance frameworks
Optimizing ML inference costs and performance
Migrating from notebooks to production ML systems
Implementing continuous training and automated retraining

Core Concepts

1. Experiment Tracking

Track experiments systematically to ensure reproducibility and collaboration.

Key Components:

Parameters: Hyperparameters logged for each training run
Metrics: Performance measures tracked over time (accuracy, loss, F1)
Artifacts: Model weights, plots, datasets, configuration files
Metadata: Tags, descriptions, Git commit SHA, environment details

Platform Comparison:

MLflow (Open-source standard):

Framework-agnostic (PyTorch, TensorFlow, scikit-learn, XGBoost)
Self-hosted or cloud-agnostic deployment
Integrated model registry
Basic UI, adequate for most use cases
Free, requires infrastructure management

Weights & Biases (SaaS, collaboration-focused):

Advanced visualization and dashboards
Integrated hyperparameter optimization (Sweeps)
Excellent team collaboration features
SaaS pricing scales with usage
Best-in-class UI

Neptune.ai (Enterprise-grade):

Enterprise features (RBAC, audit logs, compliance)
Integrated production monitoring
Higher cost than W&B
Good for regulated industries

Selection Criteria:

Open-source requirement â MLflow
Team collaboration critical â Weights & Biases
Enterprise compliance (RBAC, audits) â Neptune.ai
Hyperparameter optimization primary â Weights & Biases (Sweeps)

For detailed comparison and decision framework, see references/experiment-tracking.md.

2. Model Registry and Versioning

Centralize model artifacts with version control and stage management.

Model Registry Components:

Model artifacts (weights, serialized models)
Training metrics (accuracy, F1, AUC)
Hyperparameters used during training
Training dataset version
Feature schema (input/output signatures)
Model cards (documentation, use cases, limitations)

Stage Management:

None: Newly registered model
Staging: Testing in pre-production environment
Production: Serving live traffic
Archived: Deprecated, retained for compliance

Versioning Strategies:

Semantic Versioning for Models:

Major version (v2.0.0): Breaking change in input/output schema
Minor version (v1.1.0): New feature, backward-compatible
Patch version (v1.0.1): Bug fix, model retrained on new data

Git-Based Versioning:

Model code in Git (training scripts, configuration)
Model weights in DVC (Data Version Control) or Git-LFS
Reproducibility via commit SHA + data version hash

For model lineage tracking and registry patterns, see references/model-registry.md.

3. Feature Stores

Centralize feature engineering to ensure consistency between training and inference.

Problem Addressed: Training/serving skew

Training: Features computed with future knowledge (data leakage)
Inference: Features computed with only past data
Result: Model performs well in training but fails in production

Feature Store Solution:

Online Feature Store:

Purpose: Low-latency feature retrieval for real-time inference
Storage: Redis, DynamoDB, Cassandra (key-value stores)
Latency: Sub-10ms for feature lookup
Use Case: Real-time predictions (fraud detection, recommendations)

Offline Feature Store:

Purpose: Historical feature data for training and batch inference
Storage: Parquet files (S3/GCS), data warehouses (Snowflake, BigQuery)
Latency: Seconds to minutes (batch retrieval)
Use Case: Model training, backtesting, batch predictions

Point-in-Time Correctness:

Ensures no future data leakage during training
Feature values at time T only use data available before time T
Critical for avoiding overly optimistic training metrics

Platform Comparison:

Feast (Open-source, cloud-agnostic):

Most popular open-source feature store
Supports Redis, DynamoDB, Datastore (online) and Parquet, BigQuery, Snowflake (offline)
Cloud-agnostic, no vendor lock-in
Active community, growing adoption

Tecton (Managed, production-grade):

Feast-compatible API
Fully managed service
Integrated monitoring and governance
Higher cost, enterprise-focused

SageMaker Feature Store (AWS):

Integrated with AWS ecosystem
Managed online/offline stores
AWS lock-in

Databricks Feature Store (Databricks):

Unity Catalog integration
Delta Lake for offline storage
Databricks ecosystem lock-in

Selection Criteria:

Open-source, cloud-agnostic â Feast
Managed solution, production-grade â Tecton
AWS ecosystem â SageMaker Feature Store
Databricks users â Databricks Feature Store

For feature engineering patterns and implementation, see references/feature-stores.md.

4. Model Serving Patterns

Deploy models for synchronous, asynchronous, batch, or streaming inference.

Serving Patterns:

REST API Deployment:

Pattern: HTTP endpoint for synchronous predictions
Latency: <100ms acceptable
Use Case: Request-response applications
Tools: Flask, FastAPI, BentoML, Seldon Core

gRPC Deployment:

Pattern: High-performance RPC for low-latency inference
Latency: <10ms target
Use Case: Microservices, latency-critical applications
Tools: TensorFlow Serving, TorchServe, Seldon Core

Batch Inference:

Pattern: Process large datasets offline
Latency: Minutes to hours acceptable
Use Case: Daily/hourly predictions for millions of records
Tools: Spark, Dask, Ray

Streaming Inference:

Pattern: Real-time predictions on streaming data
Latency: Milliseconds
Use Case: Fraud detection, anomaly detection, real-time recommendations
Tools: Kafka + Flink/Spark Streaming

Platform Comparison:

Seldon Core (Kubernetes-native, advanced):

Advanced deployment strategies (canary, A/B testing, multi-armed bandits)
Multi-framework support
Integrated explainability (Alibi)
High complexity, steep learning curve

KServe (CNCF standard):

Standardized InferenceService API
Serverless scaling (scale-to-zero with Knative)
Kubernetes-native
Growing adoption, CNCF backing

BentoML (Python-first, simplicity):

Easiest to get started
Excellent developer experience
Local testing â cloud deployment
Lower complexity than Seldon/KServe

TorchServe (PyTorch official):

PyTorch-specific serving
Production-grade, optimized for PyTorch models
Less flexible for multi-framework use

TensorFlow Serving (TensorFlow official):

TensorFlow-specific serving
Production-grade, optimized for TensorFlow models
Less flexible for multi-framework use

Selection Criteria:

Kubernetes, advanced deployments â Seldon Core or KServe
Python-first, simplicity â BentoML
PyTorch-specific â TorchServe
TensorFlow-specific â TensorFlow Serving
Managed solution â SageMaker/Vertex AI/Azure ML

For model optimization and serving infrastructure, see references/model-serving.md.

5. Deployment Strategies

Deploy models safely with rollback capabilities.

Blue-Green Deployment:

Two identical environments (Blue: current, Green: new)
Deploy to Green, test, switch 100% traffic instantly
Instant rollback (switch back to Blue)
Trade-off: Requires 2x infrastructure, all-or-nothing switch

Canary Deployment:

Gradual rollout to subset of traffic
Route 5% â 10% â 25% â 50% â 100% over time
Monitor metrics at each stage, rollback if degradation
Trade-off: Complex routing logic, longer deployment time

Shadow Deployment:

New model receives traffic but predictions not used
Compare new model vs old model offline
Zero risk to production
Trade-off: Requires 2x compute, delayed feedback

A/B Testing:

Split traffic between model versions
Measure business metrics (conversion rate, revenue)
Statistical significance testing
Use Case: Optimize for business outcomes, not just ML metrics

Multi-Armed Bandit (MAB):

Epsilon-greedy: Explore (try new models) vs Exploit (use best model)
Thompson Sampling: Bayesian approach to exploration
Use Case: Continuous optimization, faster convergence than A/B

Selection Criteria:

Low-risk model â Blue-green (instant cutover)
Medium-risk model â Canary (gradual rollout)
High-risk model â Shadow (test in production, no impact)
Business optimization â A/B testing or MAB

For deployment architecture and examples, see references/deployment-strategies.md.

6. ML Pipeline Orchestration

Automate training, evaluation, and deployment workflows.

Training Pipeline Stages:

Data Validation (Great Expectations, schema checks)
Feature Engineering (transform raw data)
Data Splitting (train/validation/test)
Model Training (hyperparameter tuning)
Model Evaluation (accuracy, fairness, explainability)
Model Registration (push to registry if metrics pass thresholds)
Deployment (promote to staging/production)

Continuous Training Pattern:

Monitor production data for drift
Detect data distribution changes (KS test, PSI)
Trigger automated retraining when drift detected
Validate new model before deployment
Deploy via canary or shadow strategy

Platform Comparison:

Kubeflow Pipelines (ML-native, Kubernetes):

ML-specific pipeline orchestration
Kubernetes-native (scales with K8s)
Component-based (reusable pipeline steps)
Integrated with Katib (hyperparameter tuning)

Apache Airflow (Mature, general-purpose):

Most mature orchestration platform
Large ecosystem, extensive integrations
Python-based DAGs
Not ML-specific but widely used for ML workflows

Metaflow (Netflix, data science-friendly):

Human-centric design, easy for data scientists
Excellent local development experience
Versioning built-in
Simpler than Kubeflow/Airflow

Prefect (Modern, Python-native):

Dynamic workflows, not static DAGs
Better error handling than Airflow
Modern UI and developer experience
Growing community

Dagster (Asset-based, testing-focused):

Asset-based thinking (not just task dependencies)
Strong testing and data quality features
Modern approach, good for data teams
Smaller community than Airflow

Selection Criteria:

ML-specific, Kubernetes â Kubeflow Pipelines
Mature, battle-tested â Apache Airflow
Data scientists, ease of use â Metaflow
Software engineers, testing â Dagster
Modern, simpler than Airflow â Prefect

For pipeline architecture and examples, see references/ml-pipelines.md.

7. Model Monitoring and Observability

Monitor production models for drift, performance, and quality.

Data Drift Detection:

Definition: Input feature distributions change over time
Impact: Model trained on old distribution, predictions degrade
Detection Methods:
- Kolmogorov-Smirnov (KS) Test: Compare distributions
- Population Stability Index (PSI): Measure distribution shift
- Chi-Square Test: For categorical features
Action: Trigger automated retraining when drift detected

Model Drift Detection:

Definition: Model prediction quality degrades over time
Impact: Accuracy, precision, recall decrease
Detection Methods:
- Ground truth accuracy (delayed labels)
- Prediction distribution changes
- Calibration drift (predicted probabilities vs actual outcomes)
Action: Alert team, trigger retraining

Performance Monitoring:

Metrics:
- Latency: P50, P95, P99 inference time
- Throughput: Predictions per second
- Error Rate: Failed predictions / total predictions
- Resource Utilization: CPU, memory, GPU usage
Alerting Thresholds:
- P95 latency > 100ms â Alert
- Error rate > 1% â Alert
- Accuracy drop > 5% â Trigger retraining

Business Metrics Monitoring:

Downstream impact: Conversion rate, revenue, user satisfaction
Model predictions â business outcomes correlation
Use Case: Optimize models for business value, not just ML metrics

Tools:

Evidently AI: Data drift, model drift, data quality reports
Prometheus + Grafana: Performance metrics, custom dashboards
Arize AI: ML observability platform
Fiddler: Model monitoring and explainability

For monitoring architecture and implementation, see references/model-monitoring.md.

8. Model Optimization Techniques

Reduce model size and inference latency.

Quantization:

Convert model weights from float32 to int8
Model size reduction: 4x smaller
Inference speed: 2-3x faster
Accuracy impact: Minimal (<1% degradation typically)
Tools: PyTorch quantization, TensorFlow Lite, ONNX Runtime

Model Distillation:

Train small student model to mimic large teacher model
Transfer knowledge from teacher (BERT-large) to student (DistilBERT)
Size reduction: 2-10x smaller
Speed improvement: 2-10x faster
Use Case: Deploy small model on edge devices, reduce inference cost

ONNX Conversion:

Convert models to Open Neural Network Exchange (ONNX) format
Cross-framework compatibility (PyTorch â ONNX â TensorFlow)
Optimized inference with ONNX Runtime
Speed improvement: 1.5-3x faster than native framework

Model Pruning:

Remove less important weights from neural networks
Sparsity: 30-90% of weights set to zero
Size reduction: 2-10x smaller
Accuracy impact: Minimal with structured pruning

For optimization techniques and examples, see references/model-serving.md.

9. LLMOps Patterns

Operationalize Large Language Models with specialized patterns.

LLM Fine-Tuning Pipelines:

LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning
QLoRA: Quantized LoRA (4-bit quantization)
Pipeline: Base model â Fine-tuning dataset â LoRA adapters â Merged model
Tools: Hugging Face PEFT, Axolotl

Prompt Versioning:

Version control for prompts (Git, prompt management platforms)
A/B testing prompts for quality and cost optimization
Monitoring prompt effectiveness over time

RAG System Monitoring:

Retrieval quality: Relevance of retrieved documents
Generation quality: Answer accuracy, hallucination detection
End-to-end latency: Retrieval + generation time
Tools: LangSmith, Arize Phoenix

LLM Inference Optimization:

vLLM: High-throughput LLM serving
TensorRT-LLM: NVIDIA-optimized LLM inference
Text Generation Inference (TGI): Hugging Face serving
Batching: Dynamic batching for throughput

Embedding Model Management:

Version embeddings alongside models
Monitor embedding drift (distribution changes)
Update embeddings when underlying model changes

For LLMOps patterns and implementation, see references/llmops-patterns.md.

10. Model Governance and Compliance

Establish governance for model risk management and regulatory compliance.

Model Cards:

Documentation: Model purpose, training data, performance metrics
Limitations: Known biases, failure modes, out-of-scope use cases
Ethical considerations: Fairness, privacy, societal impact
Template: Model Card Toolkit (Google)

Bias and Fairness Detection:

Measure disparate impact across demographic groups
Tools: Fairlearn, AI Fairness 360 (IBM)
Metrics: Demographic parity, equalized odds, calibration
Mitigation: Reweighting, adversarial debiasing, threshold optimization

Regulatory Compliance:

EU AI Act: High-risk AI systems require documentation, monitoring
Model Risk Management (SR 11-7): Banking industry requirements
GDPR: Right to explanation for automated decisions
HIPAA: Healthcare data privacy

Audit Trails:

Log all model versions, training runs, deployments
Track who approved model transitions (staging â production)
Retain historical predictions for compliance audits
Tools: MLflow, Neptune.ai (audit logs)

For governance frameworks and compliance, see references/governance.md.

Decision Frameworks

Framework 1: Experiment Tracking Platform Selection

Decision Tree:

Start with primary requirement:

Open-source, self-hosted requirement â MLflow
Team collaboration, advanced visualization (budget available) â Weights & Biases
Team collaboration, advanced visualization (no budget) â MLflow
Enterprise compliance (audit logs, RBAC) â Neptune.ai
Hyperparameter optimization primary use case â Weights & Biases (Sweeps)

Detailed Criteria:

Criteria	MLflow	Weights & Biases	Neptune.ai
Cost	Free	$200/user/month	$300/user/month
Collaboration	Basic	Excellent	Good
Visualization	Basic	Excellent	Good
Hyperparameter Tuning	External (Optuna)	Integrated (Sweeps)	Basic
Model Registry	Included	Add-on	Included
Self-Hosted	Yes	No (paid only)	Limited
Enterprise Features	No	Limited	Excellent

Recommendation by Organization:

Startup (<50 people): MLflow (free, adequate) or W&B (if budget)
Growth (50-500 people): Weights & Biases (team collaboration)
Enterprise (>500 people): Neptune.ai (compliance) or MLflow (cost)

For detailed decision framework, see references/decision-frameworks.md.

Framework 2: Feature Store Selection

Decision Matrix:

Primary requirement:

Open-source, cloud-agnostic â Feast
Managed solution, production-grade, multi-cloud â Tecton
AWS ecosystem â SageMaker Feature Store
GCP ecosystem â Vertex AI Feature Store
Azure ecosystem â Azure ML Feature Store
Databricks users â Databricks Feature Store
Self-hosted with UI â Hopsworks

Criteria Comparison:

Factor	Feast	Tecton	Hopsworks	SageMaker FS
Cost	Free	$$$$	Free (self-host)	$$$
Online Serving	Redis, DynamoDB	Managed	RonDB	Managed
Offline Store	Parquet, BigQuery, Snowflake	Managed	Hive, S3	S3
Point-in-Time	Yes	Yes	Yes	Yes
Monitoring	External	Integrated	Basic	External
Cloud Lock-in	No	No	No	AWS

Recommendation:

Open-source, self-managed â Feast
Managed, production-grade â Tecton
AWS ecosystem â SageMaker Feature Store
Databricks users â Databricks Feature Store

For detailed decision framework, see references/decision-frameworks.md.

Framework 3: Model Serving Platform Selection

Decision Tree:

Infrastructure:

Kubernetes-based â Advanced deployment patterns needed?
- Yes â Seldon Core (most features) or KServe (CNCF standard)
- No â BentoML (simpler, Python-first)
Cloud-native (managed) â Cloud provider?
- AWS â SageMaker Endpoints
- GCP â Vertex AI Endpoints
- Azure â Azure ML Endpoints
Framework-specific â Framework?
- PyTorch â TorchServe
- TensorFlow â TensorFlow Serving
Serverless / minimal infrastructure â BentoML or Cloud Functions

Detailed Criteria:

Feature	Seldon Core	KServe	BentoML	TorchServe
Kubernetes-Native	Yes	Yes	Optional	No
Multi-Framework	Yes	Yes	Yes	PyTorch-only
Deployment Strategies	Excellent	Good	Basic	Basic
Explainability	Integrated	Integrated	External	No
Complexity	High	Medium	Low	Low
Learning Curve	Steep	Medium	Gentle	Gentle

Recommendation:

Kubernetes, advanced deployments â Seldon Core or KServe
Python-first, simplicity â BentoML
PyTorch-specific â TorchServe
TensorFlow-specific â TensorFlow Serving
Managed solution â SageMaker/Vertex AI/Azure ML

For detailed decision framework, see references/decision-frameworks.md.

Framework 4: ML Pipeline Orchestration Selection

Decision Matrix:

Primary use case:

ML-specific pipelines, Kubernetes-native â Kubeflow Pipelines
General-purpose orchestration, mature ecosystem â Apache Airflow
Data science workflows, ease of use â Metaflow
Modern approach, asset-based thinking â Dagster
Dynamic workflows, Python-native â Prefect

Criteria Comparison:

Factor	Kubeflow	Airflow	Metaflow	Dagster	Prefect
ML-Specific	Excellent	Good	Excellent	Good	Good
Kubernetes	Native	Compatible	Optional	Compatible	Compatible
Learning Curve	Steep	Steep	Gentle	Medium	Medium
Maturity	High	Very High	Medium	Medium	Medium
Community	Large	Very Large	Growing	Growing	Growing

Recommendation:

ML-specific, Kubernetes â Kubeflow Pipelines
Mature, battle-tested â Apache Airflow
Data scientists â Metaflow
Software engineers â Dagster
Modern, simpler than Airflow â Prefect

For detailed decision framework, see references/decision-frameworks.md.

Implementation Patterns

Pattern 1: End-to-End ML Pipeline

Automate the complete ML workflow from data to deployment.

Pipeline Stages:

Data Validation (Great Expectations)
Feature Engineering (transform raw data)
Data Splitting (train/validation/test)
Model Training (with hyperparameter tuning)
Model Evaluation (accuracy, fairness, explainability)
Model Registration (push to MLflow registry)
Deployment (promote to staging/production)

Architecture:

Data Lake â Data Validation â Feature Engineering â Training â Evaluation
    â
Model Registry (staging) â Testing â Production Deployment

For implementation details and code examples, see references/ml-pipelines.md.

Pattern 2: Continuous Training

Automate model retraining based on drift detection.

Workflow:

Monitor production data for distribution changes
Detect data drift (KS test, PSI)
Trigger automated retraining pipeline
Validate new model (accuracy, fairness)
Deploy via canary strategy (5% â 100%)
Monitor new model performance
Rollback if metrics degrade

Trigger Conditions:

Scheduled: Daily/weekly retraining
Data drift: KS test p-value < 0.05
Model drift: Accuracy drop > 5%
Data volume: New training data exceeds threshold (10K samples)

For implementation details, see references/ml-pipelines.md.

Pattern 3: Feature Store Integration

Ensure consistent features between training and inference.

Architecture:

Offline Store (Training):
  Parquet/BigQuery â Point-in-Time Join â Training Dataset

Online Store (Inference):
  Redis/DynamoDB â Low-Latency Lookup â Real-Time Prediction

Point-in-Time Correctness:

Training: Fetch features as of specific timestamps (no future data)
Inference: Fetch latest features (only past data)
Guarantee: Same feature logic in training and inference

For implementation details and code examples, see references/feature-stores.md.

Pattern 4: Shadow Deployment Testing

Test new models in production without risk.

Workflow:

Deploy new model (v2) in shadow mode
v2 receives copy of production traffic
v1 predictions used for responses (no user impact)
Compare v1 and v2 predictions offline
Analyze differences, measure v2 accuracy
Promote v2 to production if performance acceptable

Use Cases:

High-risk models (financial, healthcare, safety-critical)
Need extensive testing before cutover
Compare model behavior on real production data

For deployment architecture, see references/deployment-strategies.md.

Tool Recommendations

Production-Ready Tools (High Adoption)

MLflow – Experiment Tracking & Model Registry

GitHub Stars: 20,000+
Trust Score: 95/100
Use Cases: Experiment tracking, model registry, model serving
Strengths: Open-source, framework-agnostic, self-hosted option
Getting Started: pip install mlflow && mlflow server

Feast – Feature Store

GitHub Stars: 5,000+
Trust Score: 85/100
Use Cases: Online/offline feature serving, point-in-time correctness
Strengths: Cloud-agnostic, most popular open-source feature store
Getting Started: pip install feast && feast init

Seldon Core – Model Serving (Advanced)

GitHub Stars: 4,000+
Trust Score: 85/100
Use Cases: Kubernetes-native serving, advanced deployment patterns
Strengths: Canary, A/B testing, MAB, explainability
Limitation: High complexity, steep learning curve

KServe – Model Serving (CNCF Standard)

GitHub Stars: 3,500+
Trust Score: 85/100
Use Cases: Standardized serving API, serverless scaling
Strengths: CNCF project, Knative integration, growing adoption
Limitation: Kubernetes required

BentoML – Model Serving (Simplicity)

GitHub Stars: 6,000+
Trust Score: 80/100
Use Cases: Easy packaging, Python-first deployment
Strengths: Lowest learning curve, excellent developer experience
Limitation: Fewer advanced features than Seldon/KServe

Kubeflow Pipelines – ML Orchestration

GitHub Stars: 14,000+ (Kubeflow project)
Trust Score: 90/100
Use Cases: ML-specific pipelines, Kubernetes-native workflows
Strengths: ML-native, component reusability, Katib integration
Limitation: Kubernetes required, steep learning curve

Weights & Biases – Experiment Tracking (SaaS)

Trust Score: 90/100
Use Cases: Team collaboration, advanced visualization, hyperparameter tuning
Strengths: Best-in-class UI, integrated Sweeps, strong community
Limitation: SaaS pricing, no self-hosted free tier

For detailed tool comparisons, see references/tool-recommendations.md.

Tool Stack Recommendations by Organization

Startup (Cost-Optimized, Simple):

Experiment Tracking: MLflow (free, self-hosted)
Feature Store: None initially â Feast when needed
Model Serving: BentoML (easy) or cloud functions
Orchestration: Prefect or cron jobs
Monitoring: Basic logging + Prometheus

Growth Company (Balanced):

Experiment Tracking: Weights & Biases or MLflow
Feature Store: Feast (open-source, production-ready)
Model Serving: BentoML or KServe (Kubernetes-based)
Orchestration: Kubeflow Pipelines or Airflow
Monitoring: Evidently + Prometheus + Grafana

Enterprise (Full Stack):

Experiment Tracking: MLflow (self-hosted) or Neptune.ai (compliance)
Feature Store: Tecton (managed) or Feast (self-hosted)
Model Serving: Seldon Core (advanced) or KServe
Orchestration: Kubeflow Pipelines or Airflow
Monitoring: Evidently + Prometheus + Grafana + PagerDuty

Cloud-Native (Managed Services):

AWS: SageMaker (end-to-end platform)
GCP: Vertex AI (end-to-end platform)
Azure: Azure ML (end-to-end platform)

For scenario-specific recommendations, see references/scenarios.md.

Common Scenarios

Scenario 1: Startup MLOps Stack

Context: 20-person startup, 5 data scientists, 3 models (fraud detection, recommendation, churn), limited budget.

Recommendation:

Experiment Tracking: MLflow (free, self-hosted)
Model Serving: BentoML (simple, fast iteration)
Orchestration: Prefect (simpler than Airflow)
Monitoring: Prometheus + basic drift detection
Feature Store: Skip initially, use database tables

Rationale:

Minimize cost (all open-source, self-hosted)
Fast iteration (BentoML easy to deploy)
Don’t over-engineer (no Kubeflow for 3 models)
Add feature store (Feast) when scaling to 10+ models

For detailed scenario, see references/scenarios.md.

Scenario 2: Enterprise ML Platform

Context: 500-person company, 50 data scientists, 100+ models, regulatory compliance, multi-cloud.

Recommendation:

Experiment Tracking: Neptune.ai (compliance, audit logs) or MLflow (cost)
Feature Store: Feast (self-hosted, cloud-agnostic)
Model Serving: Seldon Core (advanced deployment patterns)
Orchestration: Kubeflow Pipelines (ML-native, Kubernetes)
Monitoring: Evidently + Prometheus + Grafana + PagerDuty

Rationale:

Compliance required (Neptune audit logs, RBAC)
Multi-cloud (Feast cloud-agnostic)
Advanced deployments (Seldon canary, A/B testing)
Scale (Kubernetes for 100+ models)

For detailed scenario, see references/scenarios.md.

Scenario 3: LLM Fine-Tuning Pipeline

Context: Fine-tune LLM for domain-specific use case, deploy for production serving.

Recommendation:

Experiment Tracking: MLflow (track fine-tuning runs)
Pipeline Orchestration: Kubeflow Pipelines (GPU scheduling)
Model Serving: vLLM (high-throughput LLM serving)
Prompt Versioning: Git + LangSmith
Monitoring: Arize Phoenix (RAG monitoring)

Rationale:

Track fine-tuning experiments (LoRA adapters, hyperparameters)
GPU orchestration (Kubeflow on Kubernetes)
Efficient LLM serving (vLLM optimized for throughput)
Monitor RAG systems (retrieval + generation quality)

For detailed scenario, see references/scenarios.md.

Integration with Other Skills

Direct Dependencies:

ai-data-engineering: Feature engineering, ML algorithms, data preparation
kubernetes-operations: K8s cluster management, GPU scheduling for ML workloads
observability: Monitoring, alerting, distributed tracing for ML systems

Complementary Skills:

data-architecture: Data pipelines, data lakes feeding ML models
data-transformation: dbt for feature transformation pipelines
streaming-data: Kafka, Flink for real-time ML inference
designing-distributed-systems: Scalability patterns for ML workloads
api-design-principles: ML model APIs, REST/gRPC serving patterns

Downstream Skills:

building-ai-chat: LLM-powered applications consuming ML models
visualizing-data: Dashboards for ML metrics and monitoring

Best Practices

Version Everything:
- Code: Git commit SHA for reproducibility
- Data: DVC or data version hash
- Models: Semantic versioning (v1.2.3)
- Features: Feature store versioning
Automate Testing:
- Unit tests: Model loads, accepts input, produces output
- Integration tests: End-to-end pipeline execution
- Model validation: Accuracy thresholds, fairness checks
Monitor Continuously:
- Data drift: Distribution changes over time
- Model drift: Accuracy degradation
- Performance: Latency, throughput, error rates
Start Simple:
- Begin with MLflow + basic serving (BentoML)
- Add complexity as needed (feature store, Kubeflow)
- Avoid over-engineering (don’t build Kubeflow for 2 models)
Point-in-Time Correctness:
- Use feature stores to avoid training/serving skew
- Ensure no future data leakage in training
- Consistent feature logic in training and inference
Deployment Strategies:
- Use canary for medium-risk models (gradual rollout)
- Use shadow for high-risk models (zero production impact)
- Always have rollback plan (instant switch to previous version)
Governance:
- Model cards: Document model purpose, limitations, biases
- Audit trails: Track all model versions, deployments, approvals
- Compliance: EU AI Act, model risk management (SR 11-7)
Cost Optimization:
- Quantization: Reduce model size 4x, inference speed 2-3x
- Spot instances: Train on preemptible VMs (60-90% cost reduction)
- Autoscaling: Scale inference endpoints based on load

Anti-Patterns

â Notebooks in Production:

Never deploy Jupyter notebooks to production
Use notebooks for experimentation only
Production: Use scripts, Docker containers, CI/CD pipelines

â Manual Model Deployment:

Automate deployment with CI/CD pipelines
Use model registry stage transitions (staging â production)
Eliminate human error, ensure reproducibility

â No Monitoring:

Production models without monitoring will degrade silently
Implement drift detection (data drift, model drift)
Set up alerting for accuracy drops, latency spikes

â Training/Serving Skew:

Different feature logic in training vs inference
Use feature stores to ensure consistency
Test feature parity before production deployment

â Ignoring Data Quality:

Garbage in, garbage out (GIGO)
Validate data schema, ranges, distributions
Use Great Expectations for data validation

â Over-Engineering:

Don’t build Kubeflow for 2 models
Start simple (MLflow + BentoML)
Add complexity only when necessary (10+ models)

â No Rollback Plan:

Always have ability to rollback to previous model version
Blue-green, canary, shadow deployments enable instant rollback
Test rollback procedure before production deployment

implementing-mlops

Agent 安装分布

Skill 文档

MLOps Patterns

Purpose

When to Use This Skill

Core Concepts

1. Experiment Tracking

2. Model Registry and Versioning

3. Feature Stores

4. Model Serving Patterns

5. Deployment Strategies

6. ML Pipeline Orchestration

7. Model Monitoring and Observability

8. Model Optimization Techniques

9. LLMOps Patterns

10. Model Governance and Compliance

Decision Frameworks

Framework 1: Experiment Tracking Platform Selection

Framework 2: Feature Store Selection

Framework 3: Model Serving Platform Selection

Framework 4: ML Pipeline Orchestration Selection

Implementation Patterns

Pattern 1: End-to-End ML Pipeline

Pattern 2: Continuous Training

Pattern 3: Feature Store Integration

Pattern 4: Shadow Deployment Testing

Tool Recommendations

Production-Ready Tools (High Adoption)

Tool Stack Recommendations by Organization

Common Scenarios

Scenario 1: Startup MLOps Stack

Scenario 2: Enterprise ML Platform

Scenario 3: LLM Fine-Tuning Pipeline

Integration with Other Skills

Best Practices

Anti-Patterns

Further Reading