msmodelslim
npx skills add https://github.com/ascend-ai-coding/awesome-ascend-skills --skill msmodelslim
Agent 安装分布
Skill 文档
msModelSlim – Ascend Model Compression Tool
MindStudio ModelSlim (msModelSlim) is a model compression tool optimized for Huawei Ascend AI processors. It supports quantization and compression for LLMs, MoE models, and multimodal models.
Quick Start
One-Click Quantization (V1 – Recommended)
V1 automatic quantization uses configuration files from lab_practice/ directory.
# Basic W8A8 quantization for Qwen2.5-7B-Instruct
# Config files located at: lab_practice/<model_series>/<model>-<quant_type>-v1.yaml
msmodelslim quant \
--model_path /path/to/Qwen2.5-7B-Instruct \
--save_path /path/to/output \
--device npu \
--model_type Qwen2.5-7B-Instruct \
--config_path /path/to/msmodelslim/lab_practice/qwen2.5/qwen2.5-7b-w8a8-v1.yaml \
--trust_remote_code True
# MoE model quantization (Qwen3-30B-A3B W4A8)
msmodelslim quant \
--model_path /path/to/Qwen3-30B-A3B \
--save_path /path/to/output \
--device npu \
--model_type Qwen3-30B \
--config_path /path/to/msmodelslim/lab_practice/qwen3_moe/qwen3-30b-w4a8-v1.yaml \
--trust_remote_code True
# Multi-device distributed quantization
msmodelslim quant \
--model_path /path/to/model \
--save_path /path/to/output \
--device npu:0,1,2,3 \
--model_type Qwen2.5-72B-Instruct \
--config_path /path/to/msmodelslim/lab_practice/qwen2.5/qwen2.5-72b-w8a8c8-v1.yaml \
--trust_remote_code True
Note: Find config files in
lab_practice/directory of msmodelslim repository:
- Structure:
lab_practice/<model_series>/<model>-<quant_type>-v1.yaml- Example:
lab_practice/qwen2.5/qwen2.5-7b-w8a8-v1.yaml
Traditional Quantization (V0)
cd msmodelslim
python3 example/Qwen/quant_qwen.py \
--model_path /path/to/Qwen2.5-7B-Instruct \
--save_directory /path/to/output \
--calib_file example/common/boolq.jsonl \
--w_bit 8 --a_bit 8 \
--device_type npu \
--trust_remote_code True
Installation
Prerequisites
- Python: 3.8+ (3.9+ recommended for some environments)
- CANN: 8.2.RC1+ (8.3.RC1 or 8.5.0 recommended)
- PyTorch Ascend: Ascend Extension for PyTorch
Install Steps
# 1. Clone repository
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
# 2. Run installation script
bash install.sh
# 3. For Atlas 300I Duo (sparse quantization support)
cd ${PYTHON_SITE_PACKAGES}/msmodelslim/pytorch/weight_compression/compress_graph/
sudo bash build.sh ${CANN_INSTALL_PATH}/ascend-toolkit/latest
chmod -R 550 build
Note: Do not run
msmodelslimcommands from within the source directory to avoid module path conflicts.
See references/installation.md for detailed environment setup.
Quantization Types
| Type | Weight | Activation | Description | Use Case |
|---|---|---|---|---|
| W8A8 | INT8 | INT8 | Standard 8-bit quantization | General use, balanced precision/performance |
| W8A16 | INT8 | FP16 | Weight-only quantization | Higher precision needs (MindIE only) |
| W4A8 | INT4 | INT8 | Low-bit weight quantization | Higher compression ratio |
| W8A8C8 | INT8 | INT8 + KV Cache | With KV Cache quantization | Long sequence inference |
| W8A8S | INT8 Sparse | INT8 | Sparse quantization | Atlas 300I Duo optimization |
| W16A16S | FP16 Sparse | FP16 | Float sparse quantization | High compression needs |
Quantization Type Selection
| Priority | Recommended Type |
|---|---|
| Precision first | W8A16 > W8A8 > W4A8 |
| Memory first | W4A8 > W8A8 > W8A16 |
| Long sequence | W8A8C8 (with KV Cache quant) |
| Atlas 300I Duo | W8A8S or W16A16S |
BFLOAT16 Model Notes
For models with torch_dtype=bfloat16 weights (e.g., Qwen3-30B-A3B):
If you encounter AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, this is likely a Docker image issue, not a msmodelslim limitation.
Quick Diagnosis:
# Test if torch_npu works correctly
python3 -c "import torch; import torch_npu; a = torch.tensor(1).npu(); print('NPU OK')"
If this fails, your Docker image has compatibility issues. Try:
- Use a different/updated Docker image
- Reinstall torch_npu matching your CANN version
- Ensure CANN 8.3.RC1+ for BF16 support
Container Setup: See ascend-docker for proper Docker container creation with NPU device mappings. Refer to references/docker-setup.md for msmodelslim-specific container configuration.
Algorithm Selection
Outlier Suppression Algorithms
| Algorithm | Description | When to Use |
|---|---|---|
| SmoothQuant | Co-scale activation and weight | Standard outlier suppression |
| QuaRot | Orthogonal rotation matrix | High precision requirements |
| Iterative Smooth | Iterative smoothing | Complex distributions |
| Flex Smooth | Grid search for optimal alpha/beta | Different architectures |
| KV Smooth | KV Cache smoothing | KV Cache quantization |
Quantization Algorithms
| Algorithm | Description | When to Use |
|---|---|---|
| AutoRound | SignSGD optimization for rounding | 4-bit ultra-low quantization |
| GPTQ | Column-wise optimization | High precision weight quantization |
| SSZ | Iterative scale/offset search | Uneven weight distributions |
| PDMIX | Dynamic (prefill) + static (decode) | Large model inference |
| FA3 | Per-head INT8 attention | Long sequence, MLA models |
| MinMax | Min-max range statistics | Basic quantization |
| Histogram | Histogram distribution analysis | Filter outliers |
Quick Selection Guide
- Beginners: Use one-click quantization with
--config_pathpointing tolab_practice/config files - Precision priority: QuaRot + AutoRound
- Long sequence: FA3 + KVCache Quant
- Custom model: See references/model-integration.md
See references/quantization-algorithms.md for algorithm details.
Supported Models
Large Language Models
| Model Series | One-Click | V0 Script | Notes |
|---|---|---|---|
| Qwen3 | â | example/Qwen/ |
Qwen3-8B/14B/32B |
| Qwen2.5 | â | example/Qwen/ |
7B/32B/72B/Coder |
| Qwen2 | – | example/Qwen/ |
7B/72B |
| DeepSeek-V3 | â | example/DeepSeek/ |
V3/V3.1/V3.2, R1 |
| LLaMA | – | example/Llama/ |
LLaMA2, LLaMA3.1 |
| GLM | – | example/GLM/ |
GLM-4, GLM-5 |
| InternLM2 | – | example/InternLM2/ |
InternLM2-20B |
| Kimi | – | example/Kimi/ |
Kimi K2 |
| HunYuan | – | example/HunYuan/ |
HunYuan-A52B |
MoE Models
| Model | One-Click | Notes |
|---|---|---|
| Qwen3-MoE | â | Qwen3-30B-A3B, Qwen3-235B-A22B |
| DeepSeek MoE | â | DeepSeek-V2, V3 series |
Multimodal Models
| Type | Models | Example Script |
|---|---|---|
| Vision-Language | Qwen-VL, Qwen2-VL, Qwen3-VL, InternVL2, LLaVA, GLM-4.1V | example/multimodal_vlm/ |
| Generation | FLUX, SD3, HunyuanVideo, OpenSoraPlan, Wan2.1 | example/multimodal_sd/ |
See references/model-support.md for complete support matrix.
Custom Model Integration
Quick Overview
- Create adapter file:
msmodelslim/model/my_model/model_adapter.py - Define adapter class: Inherit
TransformersModel+ interface classes - Implement interfaces:
handle_dataset,init_model,generate_model_visit, etc. - Register model: Add to
config/config.ini
Example
from msmodelslim.model.interface_hub import ModelSlimPipelineInterfaceV1
from msmodelslim.model.common.transformers import TransformersModel
class MyModelAdapter(TransformersModel, ModelSlimPipelineInterfaceV1):
def handle_dataset(self, dataset, device):
return self._get_tokenized_data(dataset, device)
def init_model(self, device):
return self._load_model(device)
def generate_model_visit(self, model):
from msmodelslim.model.common.layer_wise_forward import generated_decoder_layer_visit_func
yield from generated_decoder_layer_visit_func(model)
def generate_model_forward(self, model, inputs):
from msmodelslim.model.common.layer_wise_forward import transformers_generated_forward_func
yield from transformers_generated_forward_func(model, inputs)
See references/model-integration.md and scripts/model_adapter_template.py for complete guide.
Precision Auto-Tuning
Sensitive Layer Analysis
# Analyze model sensitivity
msmodelslim analyze --model_path /path/to/model --model_type Qwen2.5-7B-Instruct
Analysis Algorithms:
- std: Standard deviation based (recommended for general use)
- quantile: Quantile/IQR based (for long-tail distributions)
- kurtosis: Kurtosis based (for extreme value detection)
Auto-Tuning Strategy
Standing High: Binary search to minimize fallback layers while maintaining precision.
# Use auto-tuning config
msmodelslim quant \
--model_path /path/to/model \
--save_path /path/to/output \
--model_type Qwen2.5-7B-Instruct \
--config_path /path/to/auto_tuning_config.yaml
See references/precision-tuning.md for tuning strategies.
Deployment
vLLM-Ascend
# Online service
vllm serve /path/to/quantized-model \
--served-model-name "Qwen2.5-7B-w8a8" \
--max-model-len 4096 \
--quantization ascend
# Offline inference (Python)
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/quantized-model",
max_model_len=4096,
quantization="ascend"
)
outputs = llm.generate(["Hello"], SamplingParams(temperature=0.6))
MindIE
# Deploy with MindIE
# See MindIE documentation for details
Weight Conversion
# Convert to AutoAWQ/AutoGPTQ format
python3 example/common/ms_to_vllm.py --input /path/to/quantized --output /path/to/converted
See references/deployment.md for deployment details.
Output Files
After quantization, the output directory contains:
output/
âââ config.json # Original model config
âââ generation_config.json # Generation config
âââ quant_model_description.json # Quantization description
âââ quant_model_weight_w8a8.safetensors # Quantized weights
âââ tokenizer_config.json # Tokenizer config
âââ tokenizer.json # Tokenizer vocabulary
âââ vocab.json # Vocabulary (if applicable)
Troubleshooting
Common Issues
Q: Out of memory during quantization?
# Use layer-by-layer quantization (default in V1)
# Or use CPU quantization
msmodelslim quant --device cpu ...
Q: Precision degradation after quantization?
- Use higher precision type (W8A8 instead of W4A8)
- Check
lab_practice/for best practice configs - Enable outlier suppression algorithms
- See references/precision-tuning.md
Q: Model type not supported?
- Check references/model-support.md
- Implement custom adapter: references/model-integration.md
Q: How to enable debug logging?
export MSMODELSLIM_LOG_LEVEL=DEBUG
msmodelslim quant ...
Scripts & Assets
Scripts
- scripts/check_env.sh – Environment check
- scripts/quantize_model.sh – Quantization template
- scripts/model_adapter_template.py – Model adapter template
Config Templates (assets/)
- assets/quant_config_w8a8.yaml – W8A8 config
- assets/quant_config_w4a8.yaml – W4A8 config
- assets/quant_config_sparse.yaml – Sparse quantization
- assets/quant_config_pdmix.yaml – PDMIX config
Official References
- Documentation: https://msmodelslim.readthedocs.io/zh-cn/latest/
- GitCode Repository: https://gitcode.com/Ascend/msmodelslim
- vLLM-Ascend: https://docs.vllm.ai/projects/ascend/en/latest/
- Huawei Ascend: https://www.hiascend.com/document
Related Skills
- atc-model-converter – Model conversion for Ascend
- npu-smi – NPU device management
- hccl-test – HCCL performance testing