browser-onnx
npx skills add https://github.com/thongnt0208/browser-onnx-skills --skill browser-onnx
Agent 安装分布
Skill 文档
Browser-Based ONNX Inference
This skill provides a comprehensive workflow for executing ONNX models locally in the browser using ONNX Runtime Web (ORT-Web). Local inference offers significant advantages in data privacy, reduced server costs, and unlimited scalability as each user brings their own compute power.
1. Setup and Installation
Install the required library via npm:
npm install onnxruntime-web
Note: For experimental features like WebGPU or WebNN, use the nightly version onnxruntime-web@dev.
2. Global Environment Configuration
Set global ort.env flags before creating a session to optimize the runtime environment.
- WebAssembly (CPU): Enable multi-threading by setting
ort.env.wasm.numThreads(default is half of hardware concurrency) and use a Proxy Worker (ort.env.wasm.proxy = true) to keep the UI responsive. - WASM Paths: If binaries are not in the same directory as the JS bundle, manually override paths using
ort.env.wasm.wasmPathsto point to local assets or a CDN. - WebGPU (GPU): Use
ort.env.webgpu.profiling = { mode: 'default' }for performance diagnosis during development.
3. Creating an Inference Session
Initialize the session by choosing the appropriate Execution Provider (EP):
import * as ort from 'onnxruntime-web';
const session = await ort.InferenceSession.create('./model.onnx', {
executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});
4. Data Preprocessing
Input data must match the model’s training format (e.g., NCHW for vision models).
- Image-to-Tensor: Use libraries like JIMP or OpenCV.js to resize, normalize (divide by 255.0), and convert RGBA to RGB.
- Tensor Creation: Use
new ort.Tensor('float32', float32Data,)to prepare the input feeds.
5. Optimized Inference Patterns
- Graph Capture: For models with static shapes on WebGPU, enable
enableGraphCapture: trueto reduce CPU overhead by replaying kernel executions. - IO Binding: For transformer models, keep data on the GPU by using
ort.Tensor.fromGpuBuffer()and settingpreferredOutputLocation: 'gpu-buffer'to avoid expensive memory copies. - Quantization: Prefer uint8 quantized models for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.
6. Large Model Handling (>2GB)
- Platform Limits: Browsers like Chrome limit
ArrayBufferto ~2GB. Models exceeding this must be exported with external data. - Loading External Data: Explicitly link external weight files in the session options:
const session = await ort.InferenceSession.create(modelUrl, { externalData: [{ path: './model.data', data: dataUrl }] });
7. Common Edge Cases
- Memory Management: Explicitly call
tensor.dispose()for GPU tensors to prevent memory leaks. - Zero-Sized Tensors: ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
- Thermal Throttling: Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight “tiny” models to maintain thermal equilibrium.
8. Examples
Multilingual Translation
Offload heavy translation tasks to a separate Web Worker using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.
Object Detection (YOLO)
Implement Non-Max Suppression (NMS). If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.