msmodelslim

📁 ascend-ai-coding/awesome-ascend-skills 📅 4 days ago
10
总安装量
10
周安装量
#30132
全站排名
安装命令
npx skills add https://github.com/ascend-ai-coding/awesome-ascend-skills --skill msmodelslim

Agent 安装分布

github-copilot 10
codex 10
kimi-cli 10
gemini-cli 10
cursor 10
amp 10

Skill 文档

msModelSlim – Ascend Model Compression Tool

MindStudio ModelSlim (msModelSlim) is a model compression tool optimized for Huawei Ascend AI processors. It supports quantization and compression for LLMs, MoE models, and multimodal models.


Quick Start

One-Click Quantization (V1 – Recommended)

V1 automatic quantization uses configuration files from lab_practice/ directory.

# Basic W8A8 quantization for Qwen2.5-7B-Instruct
# Config files located at: lab_practice/<model_series>/<model>-<quant_type>-v1.yaml
msmodelslim quant \
    --model_path /path/to/Qwen2.5-7B-Instruct \
    --save_path /path/to/output \
    --device npu \
    --model_type Qwen2.5-7B-Instruct \
    --config_path /path/to/msmodelslim/lab_practice/qwen2.5/qwen2.5-7b-w8a8-v1.yaml \
    --trust_remote_code True

# MoE model quantization (Qwen3-30B-A3B W4A8)
msmodelslim quant \
    --model_path /path/to/Qwen3-30B-A3B \
    --save_path /path/to/output \
    --device npu \
    --model_type Qwen3-30B \
    --config_path /path/to/msmodelslim/lab_practice/qwen3_moe/qwen3-30b-w4a8-v1.yaml \
    --trust_remote_code True

# Multi-device distributed quantization
msmodelslim quant \
    --model_path /path/to/model \
    --save_path /path/to/output \
    --device npu:0,1,2,3 \
    --model_type Qwen2.5-72B-Instruct \
    --config_path /path/to/msmodelslim/lab_practice/qwen2.5/qwen2.5-72b-w8a8c8-v1.yaml \
    --trust_remote_code True

Note: Find config files in lab_practice/ directory of msmodelslim repository:

  • Structure: lab_practice/<model_series>/<model>-<quant_type>-v1.yaml
  • Example: lab_practice/qwen2.5/qwen2.5-7b-w8a8-v1.yaml

Traditional Quantization (V0)

cd msmodelslim
python3 example/Qwen/quant_qwen.py \
    --model_path /path/to/Qwen2.5-7B-Instruct \
    --save_directory /path/to/output \
    --calib_file example/common/boolq.jsonl \
    --w_bit 8 --a_bit 8 \
    --device_type npu \
    --trust_remote_code True

Installation

Prerequisites

  • Python: 3.8+ (3.9+ recommended for some environments)
  • CANN: 8.2.RC1+ (8.3.RC1 or 8.5.0 recommended)
  • PyTorch Ascend: Ascend Extension for PyTorch

Install Steps

# 1. Clone repository
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim

# 2. Run installation script
bash install.sh

# 3. For Atlas 300I Duo (sparse quantization support)
cd ${PYTHON_SITE_PACKAGES}/msmodelslim/pytorch/weight_compression/compress_graph/
sudo bash build.sh ${CANN_INSTALL_PATH}/ascend-toolkit/latest
chmod -R 550 build

Note: Do not run msmodelslim commands from within the source directory to avoid module path conflicts.

See references/installation.md for detailed environment setup.


Quantization Types

Type Weight Activation Description Use Case
W8A8 INT8 INT8 Standard 8-bit quantization General use, balanced precision/performance
W8A16 INT8 FP16 Weight-only quantization Higher precision needs (MindIE only)
W4A8 INT4 INT8 Low-bit weight quantization Higher compression ratio
W8A8C8 INT8 INT8 + KV Cache With KV Cache quantization Long sequence inference
W8A8S INT8 Sparse INT8 Sparse quantization Atlas 300I Duo optimization
W16A16S FP16 Sparse FP16 Float sparse quantization High compression needs

Quantization Type Selection

Priority Recommended Type
Precision first W8A16 > W8A8 > W4A8
Memory first W4A8 > W8A8 > W8A16
Long sequence W8A8C8 (with KV Cache quant)
Atlas 300I Duo W8A8S or W16A16S

BFLOAT16 Model Notes

For models with torch_dtype=bfloat16 weights (e.g., Qwen3-30B-A3B):

If you encounter AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, this is likely a Docker image issue, not a msmodelslim limitation.

Quick Diagnosis:

# Test if torch_npu works correctly
python3 -c "import torch; import torch_npu; a = torch.tensor(1).npu(); print('NPU OK')"

If this fails, your Docker image has compatibility issues. Try:

  1. Use a different/updated Docker image
  2. Reinstall torch_npu matching your CANN version
  3. Ensure CANN 8.3.RC1+ for BF16 support

Container Setup: See ascend-docker for proper Docker container creation with NPU device mappings. Refer to references/docker-setup.md for msmodelslim-specific container configuration.


Algorithm Selection

Outlier Suppression Algorithms

Algorithm Description When to Use
SmoothQuant Co-scale activation and weight Standard outlier suppression
QuaRot Orthogonal rotation matrix High precision requirements
Iterative Smooth Iterative smoothing Complex distributions
Flex Smooth Grid search for optimal alpha/beta Different architectures
KV Smooth KV Cache smoothing KV Cache quantization

Quantization Algorithms

Algorithm Description When to Use
AutoRound SignSGD optimization for rounding 4-bit ultra-low quantization
GPTQ Column-wise optimization High precision weight quantization
SSZ Iterative scale/offset search Uneven weight distributions
PDMIX Dynamic (prefill) + static (decode) Large model inference
FA3 Per-head INT8 attention Long sequence, MLA models
MinMax Min-max range statistics Basic quantization
Histogram Histogram distribution analysis Filter outliers

Quick Selection Guide

  • Beginners: Use one-click quantization with --config_path pointing to lab_practice/ config files
  • Precision priority: QuaRot + AutoRound
  • Long sequence: FA3 + KVCache Quant
  • Custom model: See references/model-integration.md

See references/quantization-algorithms.md for algorithm details.


Supported Models

Large Language Models

Model Series One-Click V0 Script Notes
Qwen3 ✓ example/Qwen/ Qwen3-8B/14B/32B
Qwen2.5 ✓ example/Qwen/ 7B/32B/72B/Coder
Qwen2 example/Qwen/ 7B/72B
DeepSeek-V3 ✓ example/DeepSeek/ V3/V3.1/V3.2, R1
LLaMA example/Llama/ LLaMA2, LLaMA3.1
GLM example/GLM/ GLM-4, GLM-5
InternLM2 example/InternLM2/ InternLM2-20B
Kimi example/Kimi/ Kimi K2
HunYuan example/HunYuan/ HunYuan-A52B

MoE Models

Model One-Click Notes
Qwen3-MoE ✓ Qwen3-30B-A3B, Qwen3-235B-A22B
DeepSeek MoE ✓ DeepSeek-V2, V3 series

Multimodal Models

Type Models Example Script
Vision-Language Qwen-VL, Qwen2-VL, Qwen3-VL, InternVL2, LLaVA, GLM-4.1V example/multimodal_vlm/
Generation FLUX, SD3, HunyuanVideo, OpenSoraPlan, Wan2.1 example/multimodal_sd/

See references/model-support.md for complete support matrix.


Custom Model Integration

Quick Overview

  1. Create adapter file: msmodelslim/model/my_model/model_adapter.py
  2. Define adapter class: Inherit TransformersModel + interface classes
  3. Implement interfaces: handle_dataset, init_model, generate_model_visit, etc.
  4. Register model: Add to config/config.ini

Example

from msmodelslim.model.interface_hub import ModelSlimPipelineInterfaceV1
from msmodelslim.model.common.transformers import TransformersModel

class MyModelAdapter(TransformersModel, ModelSlimPipelineInterfaceV1):
    def handle_dataset(self, dataset, device):
        return self._get_tokenized_data(dataset, device)
    
    def init_model(self, device):
        return self._load_model(device)
    
    def generate_model_visit(self, model):
        from msmodelslim.model.common.layer_wise_forward import generated_decoder_layer_visit_func
        yield from generated_decoder_layer_visit_func(model)
    
    def generate_model_forward(self, model, inputs):
        from msmodelslim.model.common.layer_wise_forward import transformers_generated_forward_func
        yield from transformers_generated_forward_func(model, inputs)

See references/model-integration.md and scripts/model_adapter_template.py for complete guide.


Precision Auto-Tuning

Sensitive Layer Analysis

# Analyze model sensitivity
msmodelslim analyze --model_path /path/to/model --model_type Qwen2.5-7B-Instruct

Analysis Algorithms:

  • std: Standard deviation based (recommended for general use)
  • quantile: Quantile/IQR based (for long-tail distributions)
  • kurtosis: Kurtosis based (for extreme value detection)

Auto-Tuning Strategy

Standing High: Binary search to minimize fallback layers while maintaining precision.

# Use auto-tuning config
msmodelslim quant \
    --model_path /path/to/model \
    --save_path /path/to/output \
    --model_type Qwen2.5-7B-Instruct \
    --config_path /path/to/auto_tuning_config.yaml

See references/precision-tuning.md for tuning strategies.


Deployment

vLLM-Ascend

# Online service
vllm serve /path/to/quantized-model \
    --served-model-name "Qwen2.5-7B-w8a8" \
    --max-model-len 4096 \
    --quantization ascend

# Offline inference (Python)
from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/quantized-model",
    max_model_len=4096,
    quantization="ascend"
)
outputs = llm.generate(["Hello"], SamplingParams(temperature=0.6))

MindIE

# Deploy with MindIE
# See MindIE documentation for details

Weight Conversion

# Convert to AutoAWQ/AutoGPTQ format
python3 example/common/ms_to_vllm.py --input /path/to/quantized --output /path/to/converted

See references/deployment.md for deployment details.


Output Files

After quantization, the output directory contains:

output/
├── config.json                      # Original model config
├── generation_config.json           # Generation config
├── quant_model_description.json     # Quantization description
├── quant_model_weight_w8a8.safetensors  # Quantized weights
├── tokenizer_config.json            # Tokenizer config
├── tokenizer.json                   # Tokenizer vocabulary
└── vocab.json                       # Vocabulary (if applicable)

Troubleshooting

Common Issues

Q: Out of memory during quantization?

# Use layer-by-layer quantization (default in V1)
# Or use CPU quantization
msmodelslim quant --device cpu ...

Q: Precision degradation after quantization?

  • Use higher precision type (W8A8 instead of W4A8)
  • Check lab_practice/ for best practice configs
  • Enable outlier suppression algorithms
  • See references/precision-tuning.md

Q: Model type not supported?

Q: How to enable debug logging?

export MSMODELSLIM_LOG_LEVEL=DEBUG
msmodelslim quant ...

Scripts & Assets

Scripts

Config Templates (assets/)


Official References


Related Skills