onnx-webgpu-converter
npx skills add https://github.com/jakerains/agentskills --skill onnx-webgpu-converter
Agent 安装分布
Skill 文档
ONNX WebGPU Model Converter
Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.
Workflow Overview
- Check if ONNX version already exists on HuggingFace
- Set up Python environment with optimum
- Export model to ONNX with optimum-cli
- Quantize for target deployment (WebGPU vs WASM)
- Upload to HuggingFace Hub (optional)
- Use in Transformers.js with WebGPU
Step 1: Check for Existing ONNX Models
Before converting, check if the model already has an ONNX version:
- Search
onnx-community/<model-name>on HuggingFace Hub - Check the model repo for an
onnx/folder - Browse https://huggingface.co/models?library=transformers.js (1200+ pre-converted)
If found, skip to Step 6.
Step 2: Environment Setup
# Create venv (recommended)
python -m venv onnx-env && source onnx-env/bin/activate
# Install optimum with ONNX support
pip install "optimum[onnx]" onnxruntime
# For GPU-accelerated export (optional)
pip install onnxruntime-gpu
Verify installation:
optimum-cli export onnx --help
Step 3: Export to ONNX
Basic Export (auto-detect task)
optimum-cli export onnx --model <model_id_or_path> ./output_dir/
With Explicit Task
optimum-cli export onnx \
--model <model_id> \
--task <task> \
./output_dir/
Common tasks: text-generation, text-classification, feature-extraction, image-classification, automatic-speech-recognition, object-detection, image-segmentation, question-answering, token-classification, zero-shot-classification
For decoder models, append -with-past for KV cache reuse (default behavior):
text-generation-with-past, text2text-generation-with-past, automatic-speech-recognition-with-past
Full CLI Reference
| Flag | Description |
|---|---|
-m MODEL, --model MODEL |
HuggingFace model ID or local path (required) |
--task TASK |
Export task (auto-detected if on Hub) |
--opset OPSET |
ONNX opset version (default: auto) |
--device DEVICE |
Export device, cpu (default) or cuda |
--optimize {O1,O2,O3,O4} |
ONNX Runtime optimization level |
--monolith |
Force single ONNX file (vs split encoder/decoder) |
--no-post-process |
Skip post-processing (e.g., decoder merging) |
--trust-remote-code |
Allow custom model code from Hub |
--pad_token_id ID |
Override pad token (needed for some models) |
--cache_dir DIR |
Cache directory for downloaded models |
--batch_size N |
Batch size for dummy inputs |
--sequence_length N |
Sequence length for dummy inputs |
--framework {pt} |
Source framework |
--atol ATOL |
Absolute tolerance for validation |
Optimization Levels
| Level | Description |
|---|---|
| O1 | Basic general optimizations |
| O2 | Basic + extended + transformer fusions |
| O3 | O2 + GELU approximation |
| O4 | O3 + mixed precision fp16 (GPU only, requires --device cuda) |
Step 4: Quantize for Web Deployment
Quantization Types for Transformers.js
| dtype | Precision | Best For | Size Reduction |
|---|---|---|---|
fp32 |
Full 32-bit | Maximum accuracy | None (baseline) |
fp16 |
Half 16-bit | WebGPU default quality | ~50% |
q8 / int8 |
8-bit | WASM default, good balance | ~75% |
q4 / bnb4 |
4-bit | Maximum compression | ~87% |
q4f16 |
4-bit weights, fp16 compute | WebGPU + small size | ~87% |
Using optimum-cli quantization
# Dynamic quantization (post-export)
optimum-cli onnxruntime quantize \
--onnx_model ./output_dir/ \
--avx512 \
-o ./quantized_dir/
Using Python API for finer control
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)
Producing Multiple dtype Variants for Transformers.js
To provide fp32, fp16, q8, and q4 variants (like onnx-community models), organize output as:
model_onnx/
âââ onnx/
â âââ model.onnx # fp32
â âââ model_fp16.onnx # fp16
â âââ model_quantized.onnx # q8
â âââ model_q4.onnx # q4
âââ config.json
âââ tokenizer.json
âââ tokenizer_config.json
Step 5: Upload to HuggingFace Hub (Optional)
# Login
huggingface-cli login
# Upload
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
# Add transformers.js tag to model card for discoverability
Step 6: Use in Transformers.js with WebGPU
Install
npm install @huggingface/transformers
Basic Pipeline with WebGPU
import { pipeline } from "@huggingface/transformers";
const pipe = await pipeline("task-name", "model-id-or-path", {
device: "webgpu", // GPU acceleration
dtype: "q4", // Quantization level
});
const result = await pipe("input text");
Per-Module dtypes (encoder-decoder models)
Some models (Whisper, Florence-2) need different quantization per component:
const model = await Florence2ForConditionalGeneration.from_pretrained(
"onnx-community/Florence-2-base-ft",
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
encoder_model: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
},
);
For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md
Troubleshooting
For conversion errors and common issues: See references/conversion-guide.md
Quick Fixes
- “Task not found”: Use
--taskflag explicitly. For decoder models trytext-generation-with-past - “trust_remote_code”: Add
--trust-remote-codeflag for custom model architectures - Out of memory: Use
--device cpuand smaller--batch_size - Validation fails: Try
--no-post-processor increase--atol - Model not supported: Check supported architectures â 120+ architectures supported
- WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)
Supported Task â Pipeline Mapping
| Task | Transformers.js Pipeline | Example Model |
|---|---|---|
| text-classification | sentiment-analysis |
distilbert-base-uncased-finetuned-sst-2 |
| text-generation | text-generation |
Qwen2.5-0.5B-Instruct |
| feature-extraction | feature-extraction |
mxbai-embed-xsmall-v1 |
| automatic-speech-recognition | automatic-speech-recognition |
whisper-tiny.en |
| image-classification | image-classification |
mobilenetv4_conv_small |
| object-detection | object-detection |
detr-resnet-50 |
| image-segmentation | image-segmentation |
segformer-b0 |
| zero-shot-image-classification | zero-shot-image-classification |
clip-vit-base-patch32 |
| depth-estimation | depth-estimation |
depth-anything-small |
| translation | translation |
nllb-200-distilled-600M |
| summarization | summarization |
bart-large-cnn |