ais-bench
npx skills add https://github.com/ascend-ai-coding/awesome-ascend-skills --skill ais-bench
Agent 安装分布
Skill 文档
AISBench Benchmark Tool
AISBench Benchmark is a model evaluation tool built based on OpenCompass. It supports evaluation scenarios for both accuracy and performance testing of AI models on Ascend NPU.
Overview
- Accuracy Evaluation: Accuracy verification of service-deployed models and local models on various QA and reasoning benchmark datasets, covering text, multimodal, and other scenarios.
- Performance Evaluation: Latency and throughput evaluation of service-deployed models, extreme performance testing under stress test scenarios, steady-state performance evaluation, and real business traffic simulation.
Supported Scenarios
| Scenario | Description |
|---|---|
| Accuracy Evaluation | Model accuracy on text/multimodal datasets |
| Performance Evaluation | Latency, throughput, stress testing |
| Steady-State Performance | Obtain true optimal system performance |
| Real Traffic Simulation | Simulate real business traffic patterns |
| Multi-turn Dialogue | Evaluate multi-turn conversation models |
| Function Call (BFCL) | Function calling capability evaluation |
Supported Benchmarks
- Text: GSM8K, MMLU, Ceval, FewCLUE series, dapo_math, leval
- Multimodal: docvqa, infovqa, ocrbench_v2, omnidocbench, mmmu, mmmu_pro, mmstar, videomme, textvqa, videobench, vocalsound
- Multi-turn Dialogue: sharegpt, mtbench
- Function Call: BFCL (Berkeley Function Calling Leaderboard)
Installation
Environment Requirements
Python Version: Only Python 3.10, 3.11, or 3.12 is supported.
# Create conda environment
conda create --name ais_bench python=3.10 -y
conda activate ais_bench
Install from Source
git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517
Verify installation:
ais_bench -h
Optional Dependencies
# For service-deployed model evaluation (vLLM, Triton, etc.)
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt
# For Huggingface multimodal / vLLM offline inference
pip3 install -r requirements/hf_vl_dependency.txt
# For BFCL Function Calling evaluation
pip3 install -r requirements/datasets/bfcl_dependencies.txt --no-deps
Quick Start
Basic Command Structure
ais_bench --models <model_task> --datasets <dataset_task> [--summarizer example]
--models: Specifies the model task configuration--datasets: Specifies the dataset task configuration--summarizer: Result presentation task (default:example)
Find Configuration Files
# List all available task configurations
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search
Example: Service Model Accuracy Evaluation
-
Start vLLM inference service (follow vLLM documentation)
-
Prepare dataset:
- Download GSM8K from opencompass
- Extract to
ais_bench/datasets/gsm8k/
-
Modify model configuration (
vllm_api_general_chat.py):from ais_bench.benchmark.models import VLLMCustomAPIChat models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr='vllm-api-general-chat', path="", model="", stream=False, request_rate=0, retry=2, api_key="", host_ip="localhost", host_port=8080, url="", max_out_len=512, batch_size=1, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, ignore_eos=False, ) ) ] -
Run evaluation:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt
Output Results
dataset version metric mode vllm_api_general_chat
----------------------- -------- -------- ----- ----------------------
demo_gsm8k 401e4c accuracy gen 62.50
Model Task Types
Service-Deployed Models
| Model Type | Description |
|---|---|
vllm_api_general_chat |
General vLLM API chat model |
vllm_api_function_call_chat |
Function calling model (BFCL) |
triton_api_* |
Triton inference service |
Local Models
| Model Type | Description |
|---|---|
hf_* |
HuggingFace models |
vllm_offline_* |
vLLM offline inference |
Performance Evaluation
Key Metrics
| Metric | Description |
|---|---|
| TTFT | Time to First Token |
| TPOT | Time Per Output Token |
| Throughput | Tokens per second |
| Latency | Request latency (P50, P90, P99) |
Performance Test Example
ais_bench --models vllm_api_general_chat --datasets custom_performance \
--mode performance --concurrency 100
Steady-State Performance
For obtaining true optimal system performance:
ais_bench --models vllm_api_general_chat --datasets sharegpt \
--stable-stage --duration 300
Real Traffic Simulation
ais_bench --models vllm_api_general_chat --datasets custom \
--rps-distribution rps_config.json
Multi-task Evaluation
Multiple Models
ais_bench --models model1 model2 model3 --datasets dataset1
Multiple Datasets
ais_bench --models model1 --datasets dataset1 dataset2 dataset3
Parallel Execution
ais_bench --models model1 model2 --datasets dataset1 dataset2 --parallel 4
Custom Datasets
Performance Custom Dataset
Create a JSONL file with custom requests:
{"input": "Your prompt here", "max_output_length": 512}
Accuracy Custom Dataset
Refer to Custom Dataset Guide
Output Structure
outputs/default/20250628_151326/
âââ configs/ # Combined configuration
âââ logs/ # Execution logs
â âââ eval/ # Evaluation logs
â âââ infer/ # Inference logs
âââ predictions/ # Raw inference results
âââ results/ # Calculated scores
âââ summary/ # Final summaries
âââ summary_*.csv
âââ summary_*.md
âââ summary_*.txt
Task Management Interface
During execution, a real-time task management interface displays:
- Task name and progress
- Time cost and status
- Log path
- Extended parameters
Controls:
Pkey: Pause/Resume screen refreshCtrl+C: Exit
Common CLI Options
| Option | Description |
|---|---|
--models |
Model task name(s) |
--datasets |
Dataset task name(s) |
--summarizer |
Result summarizer |
--search |
List config file paths |
--debug |
Print detailed logs |
--mode |
Evaluation mode (accuracy/performance) |
--parallel |
Number of parallel tasks |
--resume |
Resume from breakpoint |
--failed-only |
Re-run failed cases only |
Advanced Features
Breakpoint Resume
ais_bench --models model1 --datasets dataset1 --resume outputs/default/20250628_151326
Failed Case Re-run
ais_bench --models model1 --datasets dataset1 --failed-only --resume outputs/default/20250628_151326
Multi-file Dataset Merge
For datasets like MMLU with multiple files:
ais_bench --models model1 --datasets mmlu_merged
Repeated Inference for pass@k
ais_bench --models model1 --datasets dataset1 --repeat-n 5
Troubleshooting
Installation Issues
- Python version mismatch: Use Python 3.10/3.11/3.12
- Dependency conflicts: Use conda environment
- bfcl_eval pathlib issue: Use
--no-depsflag
Runtime Issues
- Model connection failed: Check
host_ip,host_port, and service status - Dataset not found: Download dataset to
ais_bench/datasets/ - Memory issues: Reduce
batch_sizeor use smaller dataset
Helper Scripts
Quick utility scripts for common operations:
| Script | Description |
|---|---|
| scripts/check_env.sh | Verify environment setup |
| scripts/run_accuracy_test.sh | Quick accuracy test runner |
| scripts/run_performance_test.sh | Quick performance test runner |
| scripts/parse_results.py | Parse and summarize results |
# Check environment
bash scripts/check_env.sh
# Quick accuracy test
bash scripts/run_accuracy_test.sh vllm_api_general_chat demo_gsm8k --host-port 8080
# Quick performance test
bash scripts/run_performance_test.sh vllm_api_general_chat sharegpt --concurrency 100
# Parse results
python scripts/parse_results.py outputs/default/20250628_151326
References
Detailed documentation for specific use cases:
- Model Configuration Reference: All model types (vLLM, MindIE, Triton, TGI, offline) with parameter explanations
- CLI Reference: Complete CLI options for accuracy and performance evaluation
Templates
Ready-to-use templates for custom evaluation:
| Template | Description |
|---|---|
| assets/model_config_template.py | Model configuration template |
| assets/custom_qa_template.jsonl | QA dataset template |
| assets/custom_mcq_template.csv | Multiple choice dataset template |
| assets/custom_meta_template.json | Dataset metadata template |