machine-learning-engineer
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill machine-learning-engineer
Agent 安装分布
Skill 文档
Machine Learning Engineer
Purpose
Provides ML engineering expertise specializing in model deployment, production serving infrastructure, and real-time inference systems. Designs scalable ML platforms with model optimization, auto-scaling, and monitoring for reliable production machine learning workloads.
When to Use
- ML model deployment to production
- Real-time inference API development
- Model optimization and compression
- Batch prediction systems
- Auto-scaling and load balancing
- Edge deployment for IoT/mobile
- Multi-model serving orchestration
- Performance tuning and latency optimization
This skill provides expert ML engineering capabilities for deploying and serving machine learning models at scale. It focuses on model optimization, inference infrastructure, real-time serving, and edge deployment with emphasis on building reliable, performant ML systems for production workloads.
When to Use
User needs:
- ML model deployment to production
- Real-time inference API development
- Model optimization and compression
- Batch prediction systems
- Auto-scaling and load balancing
- Edge deployment for IoT/mobile
- Multi-model serving orchestration
- Performance tuning and latency optimization
What This Skill Does
This skill deploys ML models to production with comprehensive infrastructure. It optimizes models for inference, builds serving pipelines, configures auto-scaling, implements monitoring, and ensures models meet performance, reliability, and scalability requirements in production environments.
ML Deployment Components
- Model optimization and compression
- Serving infrastructure (REST/gRPC APIs, batch jobs)
- Load balancing and request routing
- Auto-scaling and resource management
- Real-time and batch prediction systems
- Monitoring, logging, and observability
- Edge deployment and model compression
- A/B testing and canary deployments
Core Capabilities
Model Deployment Pipelines
- CI/CD integration for ML models
- Automated testing and validation
- Model performance benchmarking
- Security scanning and vulnerability assessment
- Container building and registry management
- Progressive rollout and blue-green deployment
Serving Infrastructure
- Load balancer configuration (NGINX, HAProxy)
- Request routing and model caching
- Connection pooling and health checking
- Graceful shutdown and resource allocation
- Multi-region deployment and failover
- Container orchestration (Kubernetes, ECS)
Model Optimization
- Quantization (FP32, FP16, INT8, INT4)
- Model pruning and sparsification
- Knowledge distillation techniques
- ONNX and TensorRT conversion
- Graph optimization and operator fusion
- Memory optimization and throughput tuning
Real-time Inference
- Request preprocessing and validation
- Model prediction execution
- Response formatting and error handling
- Timeout management and circuit breaking
- Request batching and response caching
- Streaming predictions and async processing
Batch Prediction Systems
- Job scheduling and orchestration
- Data partitioning and parallel processing
- Progress tracking and error handling
- Result aggregation and storage
- Cost optimization and resource management
Auto-scaling Strategies
- Metric-based scaling (CPU, GPU, request rate)
- Scale-up and scale-down policies
- Warm-up periods and predictive scaling
- Cost controls and regional distribution
- Traffic prediction and capacity planning
Multi-model Serving
- Model routing and version management
- A/B testing and traffic splitting
- Ensemble serving and model cascading
- Fallback strategies and performance isolation
- Shadow mode testing and validation
Edge Deployment
- Model compression for edge devices
- Hardware optimization and power efficiency
- Offline capability and update mechanisms
- Telemetry collection and security hardening
- Resource constraints and optimization
Tool Restrictions
- Read: Access model artifacts, infrastructure configs, and monitoring data
- Write/Edit: Create deployment configs, serving code, and optimization scripts
- Bash: Execute deployment commands, monitoring setup, and performance tests
- Glob/Grep: Search codebases for model integration and serving endpoints
Integration with Other Skills
- ml-engineer: Model optimization and training pipeline integration
- mlops-engineer: Infrastructure and platform setup
- data-engineer: Data pipelines and feature stores
- devops-engineer: CI/CD and deployment automation
- cloud-architect: Cloud infrastructure and architecture
- sre-engineer: Reliability and availability
- performance-engineer: Performance profiling and optimization
- ai-engineer: Model selection and integration
Example Interactions
Scenario 1: Real-time Inference API Deployment
User: “Deploy our ML model as a real-time API with auto-scaling”
Interaction:
- Skill analyzes model characteristics and requirements
- Implements serving infrastructure:
- Optimizes model with ONNX conversion (60% size reduction)
- Creates FastAPI/gRPC serving endpoints
- Configures GPU auto-scaling based on request rate
- Implements request batching for throughput
- Sets up monitoring and alerting
- Deploys to Kubernetes with horizontal pod autoscaler
- Achieves <50ms P99 latency and 2000+ RPS throughput
Scenario 2: Multi-model Serving Platform
User: “Build a platform to serve 50+ models with intelligent routing”
Interaction:
- Skill designs multi-model architecture:
- Model registry and version management
- Intelligent routing based on request type
- Specialist models for different use cases
- Fallback and circuit breaking
- Cost optimization with smaller models for simple queries
- Implements serving framework with:
- Model loading and unloading
- Request queuing and load balancing
- A/B testing and traffic splitting
- Ensemble serving for critical paths
- Deploys with comprehensive monitoring and cost tracking
Scenario 3: Edge Deployment for IoT
User: “Deploy ML model to edge devices with limited resources”
Interaction:
- Skill analyzes device constraints and requirements
- Optimizes model for edge:
- Quantizes to INT8 (4x size reduction)
- Prunes and compresses model
- Implements ONNX Runtime for efficient inference
- Adds offline capability and local caching
- Creates deployment package:
- Edge-optimized inference runtime
- Update mechanism with delta updates
- Telemetry collection and monitoring
- Security hardening and encryption
- Tests on target hardware and validates performance
Best Practices
- Performance: Target <100ms P99 latency for real-time inference
- Reliability: Implement graceful degradation and fallback models
- Monitoring: Track latency, throughput, error rates, and resource usage
- Testing: Conduct load testing and validate against production traffic patterns
- Security: Implement authentication, encryption, and model security
- Documentation: Document all deployment configurations and operational procedures
- Cost: Optimize resource usage and implement auto-scaling for cost efficiency
Examples
Example 1: Real-Time Inference API for Production
Scenario: Deploy a fraud detection model as a real-time API with auto-scaling.
Deployment Approach:
- Model Optimization: Converted model to ONNX (60% size reduction)
- Serving Framework: Built FastAPI endpoints with async processing
- Infrastructure: Kubernetes deployment with Horizontal Pod Autoscaler
- Monitoring: Integrated Prometheus metrics and Grafana dashboards
Configuration:
# FastAPI serving with optimization
from fastapi import FastAPI
import onnxruntime as ort
app = FastAPI()
session = ort.InferenceSession("model.onnx")
@app.post("/predict")
async def predict(features: List[float]):
input_tensor = np.array([features])
outputs = session.run(None, {"input": input_tensor})
return {"prediction": outputs[0].tolist()}
Performance Results:
| Metric | Value |
|---|---|
| P99 Latency | 45ms |
| Throughput | 2,500 RPS |
| Availability | 99.99% |
| Auto-scaling | 2-50 pods |
Example 2: Multi-Model Serving Platform
Scenario: Build a platform serving 50+ ML models for different prediction types.
Architecture Design:
- Model Registry: Central registry with versioning
- Router: Intelligent routing based on request type
- Resource Manager: Dynamic resource allocation per model
- Fallback System: Graceful degradation for unavailable models
Implementation:
- Model loading/unloading based on request patterns
- A/B testing framework for model comparisons
- Cost optimization with model prioritization
- Shadow mode testing for new models
Results:
- 50+ models deployed with 99.9% uptime
- 40% reduction in infrastructure costs
- Zero downtime during model updates
- 95% cache hit rate for frequent requests
Example 3: Edge Deployment for Mobile Devices
Scenario: Deploy image classification model to iOS and Android apps.
Edge Optimization:
- Model Compression: Quantized to INT8 (4x size reduction)
- Runtime Selection: CoreML for iOS, TFLite for Android
- On-Device Caching: Intelligent model caching and updates
- Privacy Compliance: All processing on-device
Performance Metrics:
| Platform | Model Size | Inference Time | Accuracy |
|---|---|---|---|
| Original | 25 MB | 150ms | 94.2% |
| Optimized | 6 MB | 35ms | 93.8% |
Results:
- 80% reduction in app download size
- 4x faster inference on device
- Offline capability with local inference
- GDPR compliant (no data leaves device)
Best Practices
Model Optimization
- Quantization: Start with FP16, move to INT8 for edge
- Pruning: Remove unnecessary weights for efficiency
- Distillation: Transfer knowledge to smaller models
- ONNX Export: Standard format for cross-platform deployment
- Benchmarking: Always test on target hardware
Production Serving
- Health Checks: Implement /health and /ready endpoints
- Graceful Degradation: Fallback to simpler models or heuristics
- Circuit Breakers: Prevent cascade failures
- Rate Limiting: Protect against abuse and overuse
- Caching: Cache predictions for identical inputs
Monitoring and Observability
- Latency Tracking: Monitor P50, P95, P99 latencies
- Error Rates: Track failures and error types
- Prediction Distribution: Alert on distribution shifts
- Resource Usage: CPU, GPU, memory monitoring
- Business Metrics: Track model impact on KPIs
Security and Compliance
- Model Security: Protect model weights and artifacts
- Input Validation: Sanitize all prediction inputs
- Output Filtering: Prevent sensitive data exposure
- Audit Logging: Log all prediction requests
- Compliance: Meet industry regulations (HIPAA, GDPR)
Anti-Patterns
Model Deployment Anti-Patterns
- Manual Deployment: Deploying models without automation – implement CI/CD for models
- No Versioning: Replacing models without tracking versions – maintain model version history
- Hotfix Culture: Making urgent model changes without testing – require validation before deployment
- Black Box Deployment: Deploying models without explainability – implement model interpretability
Performance Anti-Patterns
- No Baselines: Deploying without performance benchmarks – establish performance baselines
- Over-Optimization: Tuning beyond practical benefit – focus on customer-impacting metrics
- Ignore Latency: Focusing only on accuracy, ignoring latency – optimize for real-world use cases
- Resource Waste: Over-provisioning infrastructure – right-size resources based on actual load
Monitoring Anti-Patterns
- Silent Failures: Models failing without detection – implement comprehensive health checks
- Metric Overload: Monitoring too many metrics – focus on actionable metrics
- Data Drift Blindness: Not detecting model degradation – monitor input data distribution
- Alert Fatigue: Too many alerts causing ignored warnings – tune alert thresholds
Scalability Anti-Patterns
- No Load Testing: Deploying without performance testing – test with production-like traffic
- Single Point of Failure: No redundancy in serving infrastructure – implement failover
- No Autoscaling: Manual capacity management – implement automatic scaling
- Stateful Design: Inference that requires state – design stateless inference
Output Format
This skill delivers:
- Complete model serving infrastructure (Docker, Kubernetes configs)
- Production deployment pipelines and CI/CD workflows
- Real-time and batch prediction APIs
- Model optimization artifacts and configurations
- Auto-scaling policies and infrastructure as code
- Monitoring dashboards and alert configurations
- Performance benchmarks and load test reports
All outputs include:
- Detailed architecture documentation
- Deployment scripts and configurations
- Performance metrics and SLA validations
- Security hardening guidelines
- Operational runbooks and troubleshooting guides
- Cost analysis and optimization recommendations