onnx-webgpu-converter

📁 jakerains/agentskills 📅 3 days ago
1
总安装量
1
周安装量
#48336
全站排名
安装命令
npx skills add https://github.com/jakerains/agentskills --skill onnx-webgpu-converter

Agent 安装分布

mcpjam 1
claude-code 1
windsurf 1
zencoder 1
crush 1
cline 1

Skill 文档

ONNX WebGPU Model Converter

Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.

Workflow Overview

  1. Check if ONNX version already exists on HuggingFace
  2. Set up Python environment with optimum
  3. Export model to ONNX with optimum-cli
  4. Quantize for target deployment (WebGPU vs WASM)
  5. Upload to HuggingFace Hub (optional)
  6. Use in Transformers.js with WebGPU

Step 1: Check for Existing ONNX Models

Before converting, check if the model already has an ONNX version:

If found, skip to Step 6.

Step 2: Environment Setup

# Create venv (recommended)
python -m venv onnx-env && source onnx-env/bin/activate

# Install optimum with ONNX support
pip install "optimum[onnx]" onnxruntime

# For GPU-accelerated export (optional)
pip install onnxruntime-gpu

Verify installation:

optimum-cli export onnx --help

Step 3: Export to ONNX

Basic Export (auto-detect task)

optimum-cli export onnx --model <model_id_or_path> ./output_dir/

With Explicit Task

optimum-cli export onnx \
  --model <model_id> \
  --task <task> \
  ./output_dir/

Common tasks: text-generation, text-classification, feature-extraction, image-classification, automatic-speech-recognition, object-detection, image-segmentation, question-answering, token-classification, zero-shot-classification

For decoder models, append -with-past for KV cache reuse (default behavior): text-generation-with-past, text2text-generation-with-past, automatic-speech-recognition-with-past

Full CLI Reference

Flag Description
-m MODEL, --model MODEL HuggingFace model ID or local path (required)
--task TASK Export task (auto-detected if on Hub)
--opset OPSET ONNX opset version (default: auto)
--device DEVICE Export device, cpu (default) or cuda
--optimize {O1,O2,O3,O4} ONNX Runtime optimization level
--monolith Force single ONNX file (vs split encoder/decoder)
--no-post-process Skip post-processing (e.g., decoder merging)
--trust-remote-code Allow custom model code from Hub
--pad_token_id ID Override pad token (needed for some models)
--cache_dir DIR Cache directory for downloaded models
--batch_size N Batch size for dummy inputs
--sequence_length N Sequence length for dummy inputs
--framework {pt} Source framework
--atol ATOL Absolute tolerance for validation

Optimization Levels

Level Description
O1 Basic general optimizations
O2 Basic + extended + transformer fusions
O3 O2 + GELU approximation
O4 O3 + mixed precision fp16 (GPU only, requires --device cuda)

Step 4: Quantize for Web Deployment

Quantization Types for Transformers.js

dtype Precision Best For Size Reduction
fp32 Full 32-bit Maximum accuracy None (baseline)
fp16 Half 16-bit WebGPU default quality ~50%
q8 / int8 8-bit WASM default, good balance ~75%
q4 / bnb4 4-bit Maximum compression ~87%
q4f16 4-bit weights, fp16 compute WebGPU + small size ~87%

Using optimum-cli quantization

# Dynamic quantization (post-export)
optimum-cli onnxruntime quantize \
  --onnx_model ./output_dir/ \
  --avx512 \
  -o ./quantized_dir/

Using Python API for finer control

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)

Producing Multiple dtype Variants for Transformers.js

To provide fp32, fp16, q8, and q4 variants (like onnx-community models), organize output as:

model_onnx/
├── onnx/
│   ├── model.onnx              # fp32
│   ├── model_fp16.onnx         # fp16
│   ├── model_quantized.onnx    # q8
│   └── model_q4.onnx           # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json

Step 5: Upload to HuggingFace Hub (Optional)

# Login
huggingface-cli login

# Upload
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/

# Add transformers.js tag to model card for discoverability

Step 6: Use in Transformers.js with WebGPU

Install

npm install @huggingface/transformers

Basic Pipeline with WebGPU

import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("task-name", "model-id-or-path", {
  device: "webgpu",    // GPU acceleration
  dtype: "q4",         // Quantization level
});

const result = await pipe("input text");

Per-Module dtypes (encoder-decoder models)

Some models (Whisper, Florence-2) need different quantization per component:

const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);

For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md

Troubleshooting

For conversion errors and common issues: See references/conversion-guide.md

Quick Fixes

  • “Task not found”: Use --task flag explicitly. For decoder models try text-generation-with-past
  • “trust_remote_code”: Add --trust-remote-code flag for custom model architectures
  • Out of memory: Use --device cpu and smaller --batch_size
  • Validation fails: Try --no-post-process or increase --atol
  • Model not supported: Check supported architectures — 120+ architectures supported
  • WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)

Supported Task → Pipeline Mapping

Task Transformers.js Pipeline Example Model
text-classification sentiment-analysis distilbert-base-uncased-finetuned-sst-2
text-generation text-generation Qwen2.5-0.5B-Instruct
feature-extraction feature-extraction mxbai-embed-xsmall-v1
automatic-speech-recognition automatic-speech-recognition whisper-tiny.en
image-classification image-classification mobilenetv4_conv_small
object-detection object-detection detr-resnet-50
image-segmentation image-segmentation segformer-b0
zero-shot-image-classification zero-shot-image-classification clip-vit-base-patch32
depth-estimation depth-estimation depth-anything-small
translation translation nllb-200-distilled-600M
summarization summarization bart-large-cnn