machine-learning-engineer

📁 404kidwiz/claude-supercode-skills 📅 Jan 24, 2026
97
总安装量
97
周安装量
#2368
全站排名
安装命令
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill machine-learning-engineer

Agent 安装分布

opencode 72
gemini-cli 63
codex 60
github-copilot 48
cursor 40

Skill 文档

Machine Learning Engineer

Purpose

Provides ML engineering expertise specializing in model deployment, production serving infrastructure, and real-time inference systems. Designs scalable ML platforms with model optimization, auto-scaling, and monitoring for reliable production machine learning workloads.

When to Use

  • ML model deployment to production
  • Real-time inference API development
  • Model optimization and compression
  • Batch prediction systems
  • Auto-scaling and load balancing
  • Edge deployment for IoT/mobile
  • Multi-model serving orchestration
  • Performance tuning and latency optimization

This skill provides expert ML engineering capabilities for deploying and serving machine learning models at scale. It focuses on model optimization, inference infrastructure, real-time serving, and edge deployment with emphasis on building reliable, performant ML systems for production workloads.

When to Use

User needs:

  • ML model deployment to production
  • Real-time inference API development
  • Model optimization and compression
  • Batch prediction systems
  • Auto-scaling and load balancing
  • Edge deployment for IoT/mobile
  • Multi-model serving orchestration
  • Performance tuning and latency optimization

What This Skill Does

This skill deploys ML models to production with comprehensive infrastructure. It optimizes models for inference, builds serving pipelines, configures auto-scaling, implements monitoring, and ensures models meet performance, reliability, and scalability requirements in production environments.

ML Deployment Components

  • Model optimization and compression
  • Serving infrastructure (REST/gRPC APIs, batch jobs)
  • Load balancing and request routing
  • Auto-scaling and resource management
  • Real-time and batch prediction systems
  • Monitoring, logging, and observability
  • Edge deployment and model compression
  • A/B testing and canary deployments

Core Capabilities

Model Deployment Pipelines

  • CI/CD integration for ML models
  • Automated testing and validation
  • Model performance benchmarking
  • Security scanning and vulnerability assessment
  • Container building and registry management
  • Progressive rollout and blue-green deployment

Serving Infrastructure

  • Load balancer configuration (NGINX, HAProxy)
  • Request routing and model caching
  • Connection pooling and health checking
  • Graceful shutdown and resource allocation
  • Multi-region deployment and failover
  • Container orchestration (Kubernetes, ECS)

Model Optimization

  • Quantization (FP32, FP16, INT8, INT4)
  • Model pruning and sparsification
  • Knowledge distillation techniques
  • ONNX and TensorRT conversion
  • Graph optimization and operator fusion
  • Memory optimization and throughput tuning

Real-time Inference

  • Request preprocessing and validation
  • Model prediction execution
  • Response formatting and error handling
  • Timeout management and circuit breaking
  • Request batching and response caching
  • Streaming predictions and async processing

Batch Prediction Systems

  • Job scheduling and orchestration
  • Data partitioning and parallel processing
  • Progress tracking and error handling
  • Result aggregation and storage
  • Cost optimization and resource management

Auto-scaling Strategies

  • Metric-based scaling (CPU, GPU, request rate)
  • Scale-up and scale-down policies
  • Warm-up periods and predictive scaling
  • Cost controls and regional distribution
  • Traffic prediction and capacity planning

Multi-model Serving

  • Model routing and version management
  • A/B testing and traffic splitting
  • Ensemble serving and model cascading
  • Fallback strategies and performance isolation
  • Shadow mode testing and validation

Edge Deployment

  • Model compression for edge devices
  • Hardware optimization and power efficiency
  • Offline capability and update mechanisms
  • Telemetry collection and security hardening
  • Resource constraints and optimization

Tool Restrictions

  • Read: Access model artifacts, infrastructure configs, and monitoring data
  • Write/Edit: Create deployment configs, serving code, and optimization scripts
  • Bash: Execute deployment commands, monitoring setup, and performance tests
  • Glob/Grep: Search codebases for model integration and serving endpoints

Integration with Other Skills

  • ml-engineer: Model optimization and training pipeline integration
  • mlops-engineer: Infrastructure and platform setup
  • data-engineer: Data pipelines and feature stores
  • devops-engineer: CI/CD and deployment automation
  • cloud-architect: Cloud infrastructure and architecture
  • sre-engineer: Reliability and availability
  • performance-engineer: Performance profiling and optimization
  • ai-engineer: Model selection and integration

Example Interactions

Scenario 1: Real-time Inference API Deployment

User: “Deploy our ML model as a real-time API with auto-scaling”

Interaction:

  1. Skill analyzes model characteristics and requirements
  2. Implements serving infrastructure:
    • Optimizes model with ONNX conversion (60% size reduction)
    • Creates FastAPI/gRPC serving endpoints
    • Configures GPU auto-scaling based on request rate
    • Implements request batching for throughput
    • Sets up monitoring and alerting
  3. Deploys to Kubernetes with horizontal pod autoscaler
  4. Achieves <50ms P99 latency and 2000+ RPS throughput

Scenario 2: Multi-model Serving Platform

User: “Build a platform to serve 50+ models with intelligent routing”

Interaction:

  1. Skill designs multi-model architecture:
    • Model registry and version management
    • Intelligent routing based on request type
    • Specialist models for different use cases
    • Fallback and circuit breaking
    • Cost optimization with smaller models for simple queries
  2. Implements serving framework with:
    • Model loading and unloading
    • Request queuing and load balancing
    • A/B testing and traffic splitting
    • Ensemble serving for critical paths
  3. Deploys with comprehensive monitoring and cost tracking

Scenario 3: Edge Deployment for IoT

User: “Deploy ML model to edge devices with limited resources”

Interaction:

  1. Skill analyzes device constraints and requirements
  2. Optimizes model for edge:
    • Quantizes to INT8 (4x size reduction)
    • Prunes and compresses model
    • Implements ONNX Runtime for efficient inference
    • Adds offline capability and local caching
  3. Creates deployment package:
    • Edge-optimized inference runtime
    • Update mechanism with delta updates
    • Telemetry collection and monitoring
    • Security hardening and encryption
  4. Tests on target hardware and validates performance

Best Practices

  • Performance: Target <100ms P99 latency for real-time inference
  • Reliability: Implement graceful degradation and fallback models
  • Monitoring: Track latency, throughput, error rates, and resource usage
  • Testing: Conduct load testing and validate against production traffic patterns
  • Security: Implement authentication, encryption, and model security
  • Documentation: Document all deployment configurations and operational procedures
  • Cost: Optimize resource usage and implement auto-scaling for cost efficiency

Examples

Example 1: Real-Time Inference API for Production

Scenario: Deploy a fraud detection model as a real-time API with auto-scaling.

Deployment Approach:

  1. Model Optimization: Converted model to ONNX (60% size reduction)
  2. Serving Framework: Built FastAPI endpoints with async processing
  3. Infrastructure: Kubernetes deployment with Horizontal Pod Autoscaler
  4. Monitoring: Integrated Prometheus metrics and Grafana dashboards

Configuration:

# FastAPI serving with optimization
from fastapi import FastAPI
import onnxruntime as ort

app = FastAPI()
session = ort.InferenceSession("model.onnx")

@app.post("/predict")
async def predict(features: List[float]):
    input_tensor = np.array([features])
    outputs = session.run(None, {"input": input_tensor})
    return {"prediction": outputs[0].tolist()}

Performance Results:

Metric Value
P99 Latency 45ms
Throughput 2,500 RPS
Availability 99.99%
Auto-scaling 2-50 pods

Example 2: Multi-Model Serving Platform

Scenario: Build a platform serving 50+ ML models for different prediction types.

Architecture Design:

  1. Model Registry: Central registry with versioning
  2. Router: Intelligent routing based on request type
  3. Resource Manager: Dynamic resource allocation per model
  4. Fallback System: Graceful degradation for unavailable models

Implementation:

  • Model loading/unloading based on request patterns
  • A/B testing framework for model comparisons
  • Cost optimization with model prioritization
  • Shadow mode testing for new models

Results:

  • 50+ models deployed with 99.9% uptime
  • 40% reduction in infrastructure costs
  • Zero downtime during model updates
  • 95% cache hit rate for frequent requests

Example 3: Edge Deployment for Mobile Devices

Scenario: Deploy image classification model to iOS and Android apps.

Edge Optimization:

  1. Model Compression: Quantized to INT8 (4x size reduction)
  2. Runtime Selection: CoreML for iOS, TFLite for Android
  3. On-Device Caching: Intelligent model caching and updates
  4. Privacy Compliance: All processing on-device

Performance Metrics:

Platform Model Size Inference Time Accuracy
Original 25 MB 150ms 94.2%
Optimized 6 MB 35ms 93.8%

Results:

  • 80% reduction in app download size
  • 4x faster inference on device
  • Offline capability with local inference
  • GDPR compliant (no data leaves device)

Best Practices

Model Optimization

  • Quantization: Start with FP16, move to INT8 for edge
  • Pruning: Remove unnecessary weights for efficiency
  • Distillation: Transfer knowledge to smaller models
  • ONNX Export: Standard format for cross-platform deployment
  • Benchmarking: Always test on target hardware

Production Serving

  • Health Checks: Implement /health and /ready endpoints
  • Graceful Degradation: Fallback to simpler models or heuristics
  • Circuit Breakers: Prevent cascade failures
  • Rate Limiting: Protect against abuse and overuse
  • Caching: Cache predictions for identical inputs

Monitoring and Observability

  • Latency Tracking: Monitor P50, P95, P99 latencies
  • Error Rates: Track failures and error types
  • Prediction Distribution: Alert on distribution shifts
  • Resource Usage: CPU, GPU, memory monitoring
  • Business Metrics: Track model impact on KPIs

Security and Compliance

  • Model Security: Protect model weights and artifacts
  • Input Validation: Sanitize all prediction inputs
  • Output Filtering: Prevent sensitive data exposure
  • Audit Logging: Log all prediction requests
  • Compliance: Meet industry regulations (HIPAA, GDPR)

Anti-Patterns

Model Deployment Anti-Patterns

  • Manual Deployment: Deploying models without automation – implement CI/CD for models
  • No Versioning: Replacing models without tracking versions – maintain model version history
  • Hotfix Culture: Making urgent model changes without testing – require validation before deployment
  • Black Box Deployment: Deploying models without explainability – implement model interpretability

Performance Anti-Patterns

  • No Baselines: Deploying without performance benchmarks – establish performance baselines
  • Over-Optimization: Tuning beyond practical benefit – focus on customer-impacting metrics
  • Ignore Latency: Focusing only on accuracy, ignoring latency – optimize for real-world use cases
  • Resource Waste: Over-provisioning infrastructure – right-size resources based on actual load

Monitoring Anti-Patterns

  • Silent Failures: Models failing without detection – implement comprehensive health checks
  • Metric Overload: Monitoring too many metrics – focus on actionable metrics
  • Data Drift Blindness: Not detecting model degradation – monitor input data distribution
  • Alert Fatigue: Too many alerts causing ignored warnings – tune alert thresholds

Scalability Anti-Patterns

  • No Load Testing: Deploying without performance testing – test with production-like traffic
  • Single Point of Failure: No redundancy in serving infrastructure – implement failover
  • No Autoscaling: Manual capacity management – implement automatic scaling
  • Stateful Design: Inference that requires state – design stateless inference

Output Format

This skill delivers:

  • Complete model serving infrastructure (Docker, Kubernetes configs)
  • Production deployment pipelines and CI/CD workflows
  • Real-time and batch prediction APIs
  • Model optimization artifacts and configurations
  • Auto-scaling policies and infrastructure as code
  • Monitoring dashboards and alert configurations
  • Performance benchmarks and load test reports

All outputs include:

  • Detailed architecture documentation
  • Deployment scripts and configurations
  • Performance metrics and SLA validations
  • Security hardening guidelines
  • Operational runbooks and troubleshooting guides
  • Cost analysis and optimization recommendations