ml-inference-optimization

📁 melodic-software/claude-code-plugins 📅 Jan 24, 2026

总安装量

周安装量

#34187

全站排名

安装命令

npx skills add https://github.com/melodic-software/claude-code-plugins --skill ml-inference-optimization

Agent 安装分布

antigravity 4

trae 3

windsurf 3

codex 3

gemini-cli 3

Skill 文档

ML Inference Optimization

When to Use This Skill

Use this skill when:

Optimizing ML inference latency
Reducing model size for deployment
Implementing model compression techniques
Designing inference caching strategies
Deploying models at the edge
Balancing accuracy vs. latency trade-offs

Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

Inference Optimization Overview

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                 Inference Optimization Stack                        â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                    Model Level                                â  â
â  â  Distillation â Pruning â Quantization â Architecture Search â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                              â                                      â
â                              â¼                                      â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                   Compiler Level                              â  â
â  â  Graph optimization â Operator fusion â Memory planning       â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                              â                                      â
â                              â¼                                      â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                  Runtime Level                                â  â
â  â  Batching â Caching â Async execution â Multi-threading      â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                              â                                      â
â                              â¼                                      â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                  Hardware Level                               â  â
â  â  GPU â TPU â NPU â CPU SIMD â Custom accelerators            â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Model Compression Techniques

Technique Overview

Technique	Size Reduction	Speed Improvement	Accuracy Impact
Quantization	2-4x	2-4x	Low (1-2%)
Pruning	2-10x	1-3x	Low-Medium
Distillation	3-10x	3-10x	Medium
Low-rank factorization	2-5x	1.5-3x	Low-Medium
Weight sharing	10-100x	Variable	Medium-High

Knowledge Distillation

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                    Knowledge Distillation                           â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  ââââââââââââââââ                                                   â
â  â Teacher Modelâ (Large, accurate, slow)                          â
â  â   GPT-4      â                                                   â
â  ââââââââââââââââ                                                   â
â         â                                                           â
â         â¼ Soft labels (probability distributions)                   â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                    Training Process                           â  â
â  â  Loss = Î± Ã CrossEntropy(student, hard_labels)               â  â
â  â       + (1-Î±) Ã KL_Div(student, teacher_soft_labels)         â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â         â                                                           â
â         â¼                                                           â
â  ââââââââââââââââ                                                   â
â  âStudent Model â (Small, nearly as accurate, fast)                â
â  â  DistilBERT  â                                                   â
â  ââââââââââââââââ                                                   â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Distillation Types:

Type	Description	Use Case
Response distillation	Match teacher outputs	General compression
Feature distillation	Match intermediate layers	Better transfer
Relation distillation	Match sample relationships	Structured data
Self-distillation	Model teaches itself	Regularization

Pruning Strategies

Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
â¢ Flexible, high sparsity possible
â¢ Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: âââââ¬ââââ¬ââââ¬ââââ
        â C1â C2â C3â C4â
        âââââ´ââââ´ââââ´ââââ
After:  âââââ¬ââââ¬ââââ
        â C1â C3â C4â  (Removed C2 entirely)
        âââââ´ââââ´ââââ
â¢ Works with standard hardware
â¢ Lower compression ratio

Pruning Decision Criteria:

Method	Description	Effectiveness
Magnitude-based	Remove smallest weights	Simple, effective
Gradient-based	Remove low-gradient weights	Better accuracy
Second-order	Use Hessian information	Best but expensive
Lottery ticket	Find winning subnetwork	Theoretical insight

Quantization (Detailed)

Precision Hierarchy:

FP32 (32 bits): ââââââââââââââââââââââââââââââââ
FP16 (16 bits): ââââââââââââââââ
BF16 (16 bits): ââââââââââââââââ  (different mantissa/exponent)
INT8 (8 bits):  ââââââââ
INT4 (4 bits):  ââââ
Binary (1 bit): â

Memory and Compute Scale Proportionally

Quantization Approaches:

Approach	When Applied	Quality	Effort
Dynamic quantization	Runtime	Good	Low
Static quantization	Post-training with calibration	Better	Medium
QAT	During training	Best	High

Compiler-Level Optimization

Graph Optimization

Original Graph:
Input â Conv â BatchNorm â ReLU â Conv â BatchNorm â ReLU â Output

Optimized Graph (Operator Fusion):
Input â FusedConvBNReLU â FusedConvBNReLU â Output

Benefits:
â¢ Fewer kernel launches
â¢ Better memory locality
â¢ Reduced memory bandwidth

Common Optimizations

Optimization	Description	Speedup
Operator fusion	Combine sequential ops	1.2-2x
Constant folding	Pre-compute constants	1.1-1.5x
Dead code elimination	Remove unused ops	Variable
Layout optimization	Optimize tensor memory layout	1.1-1.3x
Memory planning	Optimize buffer allocation	1.1-1.2x

Optimization Frameworks

Framework	Vendor	Best For
TensorRT	NVIDIA	NVIDIA GPUs, lowest latency
ONNX Runtime	Microsoft	Cross-platform, broad support
OpenVINO	Intel	Intel CPUs/GPUs
Core ML	Apple	Apple devices
TFLite	Google	Mobile, embedded
Apache TVM	Open source	Custom hardware, research

Runtime Optimization

Batching Strategies

No Batching:
Request 1: [Process] â Response 1      10ms
Request 2: [Process] â Response 2      10ms
Request 3: [Process] â Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] â [Process batch] â Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
â¢ Larger batch: Higher throughput, higher latency
â¢ Smaller batch: Lower latency, lower throughput

Batching Parameters:

Parameter	Description	Trade-off
`batch_size`	Maximum batch size	Throughput vs. latency
`max_wait_time`	Wait time for batch fill	Latency vs. efficiency
`min_batch_size`	Minimum before processing	Latency predictability

Caching Strategies

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                    Inference Caching Layers                         â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  Layer 1: Input Cache                                               â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â Cache exact inputs â Return cached outputs                   â   â
â  â Hit rate: Low (inputs rarely repeat exactly)                 â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                                                                     â
â  Layer 2: Embedding Cache                                           â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â Cache computed embeddings for repeated tokens/entities       â   â
â  â Hit rate: Medium (common tokens repeat)                      â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                                                                     â
â  Layer 3: KV Cache (for transformers)                               â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â Cache key-value pairs for attention                          â   â
â  â Hit rate: High (reuse across tokens in sequence)             â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                                                                     â
â  Layer 4: Result Cache                                              â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â Cache semantic equivalents (fuzzy matching)                  â   â
â  â Hit rate: Variable (depends on query distribution)           â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Semantic Caching for LLMs:

Query: "What's the capital of France?"
       â
Hash + Embed query
       â
Search cache (similarity > threshold)
       â
âââ Hit: Return cached response
âââ Miss: Generate â Cache â Return

Async and Parallel Execution

Sequential:
âââââââ âââââââ âââââââ
âPrep âââModelâââPost â  Total: 30ms
â10ms â â15ms â â5ms  â
âââââââ âââââââ âââââââ

Pipelined:
Request 1: âPrepâModelâPostâ
Request 2:      âPrepâModelâPostâ
Request 3:           âPrepâModelâPostâ

Throughput: 3x higher
Latency per request: Same

Hardware Acceleration

Hardware Comparison

Hardware	Strengths	Limitations	Best For
GPU (NVIDIA)	High parallelism, mature ecosystem	Power, cost	Training, large batch inference
TPU (Google)	Matrix ops, cloud integration	Vendor lock-in	Google Cloud workloads
NPU (Apple/Qualcomm)	Power efficient, on-device	Limited models	Mobile, edge
CPU	Flexible, available	Slower for ML	Low-batch, CPU-bound
FPGA	Customizable, low latency	Development complexity	Specialized workloads

GPU Optimization

Optimization	Description	Impact
Tensor Cores	Use FP16/INT8 tensor operations	2-8x speedup
CUDA graphs	Reduce kernel launch overhead	1.5-2x for small models
Multi-stream	Parallel execution	Higher throughput
Memory pooling	Reduce allocation overhead	Lower latency variance

Edge Deployment

Edge Constraints

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                      Edge Deployment Constraints                    â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  Resource Constraints:                                              â
â  âââ Memory: 1-4 GB (vs. 64+ GB cloud)                             â
â  âââ Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    â
â  âââ Power: 5-15W (vs. 300W+ cloud)                                â
â  âââ Storage: 16-128 GB (vs. TB cloud)                             â
â                                                                     â
â  Operational Constraints:                                           â
â  âââ No network (offline operation)                                 â
â  âââ Variable ambient conditions                                    â
â  âââ Infrequent updates                                            â
â  âââ Long deployment lifetime                                       â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Edge Optimization Strategies

Strategy	Description	Use When
Model selection	Use edge-native models (MobileNet, EfficientNet)	Accuracy acceptable
Aggressive quantization	INT8 or lower	Memory/power constrained
On-device distillation	Distill to tiny model	Extreme constraints
Split inference	Edge preprocessing, cloud inference	Network available
Model caching	Cache results locally	Repeated queries

Edge ML Frameworks

Framework	Platform	Features
TensorFlow Lite	Android, iOS, embedded	Quantization, delegates
Core ML	iOS, macOS	Neural Engine optimization
ONNX Runtime Mobile	Cross-platform	Broad model support
PyTorch Mobile	Android, iOS	Familiar API
TensorRT	NVIDIA Jetson	Maximum performance

Latency Profiling

Profiling Methodology

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                    Latency Breakdown Analysis                       â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  1. Data Loading:          ââââââââââââââââââ  15%                 â
â  2. Preprocessing:         ââââââââââââââââââ  10%                 â
â  3. Model Inference:       ââââââââââââââââââ  60%                 â
â  4. Postprocessing:        ââââââââââââââââââ   8%                 â
â  5. Response Serialization:ââââââââââââââââââ   7%                 â
â                                                                     â
â  Target: Model inference (60% = biggest optimization opportunity)  â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Profiling Tools

Tool	Use For
PyTorch Profiler	PyTorch model profiling
TensorBoard	TensorFlow visualization
NVIDIA Nsight	GPU profiling
Chrome Tracing	General timeline visualization
perf	CPU profiling

Key Metrics

Metric	Description	Target
P50 latency	Median latency	< SLA
P99 latency	Tail latency	< 2x P50
Throughput	Requests/second	Meet demand
GPU utilization	Compute usage	> 80%
Memory bandwidth	Memory usage	< limit

Optimization Workflow

Systematic Approach

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                  Optimization Workflow                              â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  1. Baseline                                                        â
â     âââ Measure current performance (latency, throughput, accuracy) â
â                                                                     â
â  2. Profile                                                         â
â     âââ Identify bottlenecks (model, data, system)                  â
â                                                                     â
â  3. Optimize (in order of effort/impact):                           â
â     âââ Hardware: Use right accelerator                             â
â     âââ Compiler: Enable optimizations (TensorRT, ONNX)            â
â     âââ Runtime: Batching, caching, async                          â
â     âââ Model: Quantization, pruning                                â
â     âââ Architecture: Distillation, model change                    â
â                                                                     â
â  4. Validate                                                        â
â     âââ Verify accuracy maintained, latency improved                â
â                                                                     â
â  5. Deploy and Monitor                                              â
â     âââ Track real-world performance                                â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Optimization Priority Matrix

                    High Impact
                         â
    Compiler Opts    âââââ¼ââââ Quantization
    (easy win)           â     (best ROI)
                         â
Low Effort âââââââââââââââ¼ââââââââââââââââ High Effort
                         â
    Batching         âââââ¼ââââ Distillation
    (quick win)          â     (major effort)
                         â
                    Low Impact

Common Patterns

Multi-Model Serving

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                                                                     â
â  Request â âââââââââââ                                              â
â            â Router  â                                              â
â            âââââââââââ                                              â
â               â   â   â                                             â
â      ââââââââââ   â   ââââââââââ                                    â
â      â¼            â¼            â¼                                    â
â  âââââââââ   âââââââââ   âââââââââ                                 â
â  â Tiny  â   â Small â   â Large â                                 â
â  â <10ms â   â <50ms â   â<500ms â                                 â
â  âââââââââ   âââââââââ   âââââââââ                                 â
â                                                                     â
â  Routing strategies:                                                â
â  â¢ Complexity-based: SimpleâTiny, ComplexâLarge                    â
â  â¢ Confidence-based: Try Tiny, escalate if low confidence          â
â  â¢ SLA-based: Route based on latency requirements                  â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Speculative Execution

Query: "Translate: Hello"
        â
        ââââ¶ Small model (draft): "Bonjour" (5ms)
        â
        ââââ¶ Large model (verify): Check "Bonjour" (10ms parallel)
             â
             âââ Accept: Return immediately
             âââ Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted

Cascade Models

Input â ââââââââââ
        â Filter â â Cheap filter (reject obvious negatives)
        ââââââââââ
             â (candidates only)
             â¼
        ââââââââââ
        â Stage 1â â Fast model (coarse ranking)
        ââââââââââ
             â (top-100)
             â¼
        ââââââââââ
        â Stage 2â â Accurate model (fine ranking)
        ââââââââââ
             â (top-10)
             â¼
         Output

Benefit: 10x cheaper, similar accuracy

Optimization Checklist

Pre-Deployment

Profile baseline performance
Identify primary bottleneck (model, data, system)
Apply compiler optimizations (TensorRT, ONNX)
Evaluate quantization (INT8 usually safe)
Tune batch size for target throughput
Test accuracy after optimization

Deployment

Configure appropriate hardware
Enable caching where applicable
Set up monitoring (latency, throughput, errors)
Configure auto-scaling policies
Implement graceful degradation

Post-Deployment

Monitor p99 latency
Track accuracy metrics
Analyze cache hit rates
Review cost efficiency
Plan iterative improvements

Related Skills

llm-serving-patterns – LLM-specific serving optimization
ml-system-design – End-to-end ML pipeline design
quality-attributes-taxonomy – Performance as quality attribute
estimation-techniques – Capacity planning for ML systems

Version History

v1.0.0 (2025-12-26): Initial release – ML inference optimization patterns

Last Updated

Date: 2025-12-26

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台