hccl-test
npx skills add https://github.com/ascend-ai-coding/awesome-ascend-skills --skill hccl-test
Agent 安装分布
Skill 文档
HCCL Performance Test
HCCLæ§è½æµè¯å·¥å ·ç¨äºæµè¯HCCLï¼Huawei Collective Communication Libraryï¼éåéä¿¡çåè½æ£ç¡®æ§ä»¥åæ§è½ã
Overview
- éç¨åºæ¯ï¼åå¸å¼è®ç»åºæ¯ä¸çéåéä¿¡æ§è½æµè¯
- æºç ä½ç½®ï¼
${INSTALL_DIR}/tools/hccl_test - æ¯æçæ¬ï¼CANN 8.3.RC1, CANN 8.5, CANN 25.RC
æ¯æç产ååå·
| 产åç³»å | æå¤§ Rank æ° | 夿³¨ |
|---|---|---|
| Atlas è®ç»ç³»å产å | 4096 | – |
| Atlas A2 è®ç»ç³»å产å | 32K | – |
| Atlas A3 è®ç»ç³»å产å/Atlas A3 æ¨çç³»å产å | 32K | AlltoAll/AlltoAllV æå¤§ 8K |
| Atlas 300I Duo æ¨çå¡ | – | – |
æ ¸å¿ç®åï¼æ¨èæµè¯ï¼
åå¸å¼è®ç»åºæ¯æå¸¸ç¨ï¼
| ç®å | 坿§è¡æä»¶ | éä¿¡æ¨¡å¼ | éç¨åºæ¯ | æ¨è度 |
|---|---|---|---|---|
| AllReduce | all_reduce_test |
å¤å¯¹å¤ | 梯度èåã忰忥 | âââ å¿ æµ |
| AllGather | all_gather_test |
å¤å¯¹å¤ | æ°æ®èåãåæ°æ¶é | âââ å¿ æµ |
| Broadcast | broadcast_test |
ä¸å¯¹å¤ | é ç½®ååãåå§å | ââ å¯é |
| AlltoAll | alltoall_test |
å¤å¯¹å¤ | æ°æ®éæãè´è½½åè¡¡ | ââ å¯é |
æç¤º: 宿´ç®ååè¡¨è§ references/parameters.md
Quick Reference
# 1. åç½®æ£æ¥ï¼å¤æºæµè¯å¿
éï¼
./scripts/pre-test-check.sh 175.99.1.2 175.99.1.3
# 2. ç¼è¯å·¥å
·
cd ${INSTALL_DIR}/tools/hccl_test
make MPI_HOME=/usr/local/mpich ASCEND_DIR=${INSTALL_DIR}
# 3. å¿«éè¿éæ§æµè¯ï¼åæºï¼
mpirun -n 8 ./bin/all_reduce_test -p 8 -b 8K -e 64M -f 2 -d fp32 -o sum
# 4. 宿´æ§è½æµè¯ï¼å¤æºï¼æ¨èï¼
mpirun -f hostfile -n 16 ./bin/all_reduce_test -p 8 -b 8K -e 1G -f 2 -d fp32 -o sum
1. Pre-test Checklistï¼å¤æºæµè¯å¿ éï¼
â ï¸ éè¦: 夿ºæµè¯åå¿ é¡»å®æä»¥ä¸æ£æ¥ï¼å¦åå¯è½åºç°å»ºé¾è¶ æ¶ææµè¯å¤±è´¥ã
1.1 SSH å å¯ç»å½é ç½®
ææèç¹é´å¿ é¡»é ç½® SSH å å¯ç»å½ï¼
# 1. çæ SSH å¯é¥ï¼å¦å·²åå¨å¯è·³è¿ï¼
ssh-keygen -t rsa
# 2. å°å
¬é¥å¤å¶å°ææèç¹ï¼å
æ¬æ¬æºï¼
ssh-copy-id -i ~/.ssh/id_rsa.pub root@<node1_ip>
ssh-copy-id -i ~/.ssh/id_rsa.pub root@<node2_ip>
# 3. éªè¯å
å¯ç»å½
ssh root@<node1_ip> "echo 'SSH OK'"
ssh root@<node2_ip> "echo 'SSH OK'"
1.2 CANN çæ¬ä¸è´æ§æ£æ¥
夿º CANN çæ¬å¿ é¡»ä¸è´ï¼å¦åä¼å¯¼è´æµè¯å¤±è´¥ï¼
# æ£æ¥ææèç¹ç CANN çæ¬
for node in 175.99.1.2 175.99.1.3; do
echo "=== $node ==="
ssh root@$node "cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg | grep runtime_running_version"
done
注æ: çæ¬ä¸ä¸è´æ¶éç»ä¸çæ¬ï¼å»ºè®®ç»ä¸ä¸ºææ° RC çæ¬ï¼ã
1.3 NPU å¥åº·ç¶ææ£æ¥
æµè¯åéç¡®è®¤ææ NPU ç¶ææ£å¸¸ï¼
# æ£æ¥ææèç¹ç NPU å¥åº·ç¶æ
for node in 175.99.1.2 175.99.1.3; do
echo "=== $node NPU Health ==="
ssh root@$node "npu-smi info -t health -i 0"
done
NPU ç¶æè¯´æï¼
| ç¶æ | 说æ | æä½å»ºè®® |
|---|---|---|
| OK | æ£å¸¸ | â å¯ä»¥ä½¿ç¨ |
| Alarm | åè¦ | â ï¸ éææ¥æ é |
| Offline | 离线 | â ä¸å¯ä½¿ç¨ |
å¦åå¨ Alarm ç¶æï¼éæé¤æ éå¡ãä¾å¦ NPU 0 æ éï¼ä½¿ç¨ 7 å¼ å¡æµè¯ã
1.4 ä¸é®åç½®æ£æ¥èæ¬
# ä½¿ç¨æä¾çæ£æ¥èæ¬ï¼æ¨èï¼
./scripts/pre-test-check.sh 175.99.1.2 175.99.1.3
2. MPI Installation
HCCLæ§è½æµè¯å·¥å ·ä¾èµMPIæèµ·å¤ä¸ªè¿ç¨ï¼é»è®¤ä½¿ç¨ MPICHã
2.1 MPICH Installation (Recommended)
ä¸è½½å°åï¼https://www.mpich.org/static/downloads/
| 产åç³»å | æ¨èçæ¬ |
|---|---|
| Atlas A3 è®ç»ç³»å产å/Atlas A3 æ¨çç³»å产å | MPICH 4.1.3 |
| Atlas A2 è®ç»ç³»å产å | MPICH 3.2.1 |
| Atlas è®ç»ç³»å产å | MPICH 3.2.1 |
| Atlas 300I Duo æ¨çå¡ | MPICH 3.2.1 |
å®è£ æ¥éª¤ï¼
# 1. è§£å
tar -zxvf mpich-${version}.tar.gz
cd mpich-${version}
# 2. é
ç½®ç¼è¯é项
# Atlas A3 产åï¼å¿
é¡»ä½¿ç¨ TCP åè®®ï¼
./configure --disable-fortran --prefix=/usr/local/mpich --with-device=ch3:nemesis
# å
¶ä»äº§å
./configure --disable-fortran --prefix=/usr/local/mpich
# 3. ç¼è¯å®è£
make -j 32 && make install
2.2 Open MPI Installation (Alternative)
éç¨äºéè¦ IPv6 æ¯æçåºæ¯ã
tar -zxvf openmpi-4.1.5.tar.gz
cd openmpi-4.1.5
./configure --disable-fortran --enable-ipv6 --prefix=/usr/local/openmpi
make -j 32 && make install
2.3 ç¯å¢é ç½®
# MPICH ç¯å¢
export INSTALL_DIR=/usr/local/Ascend/ascend-toolkit/latest
export PATH=/usr/local/mpich/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/mpich/lib:${INSTALL_DIR}/lib64:$LD_LIBRARY_PATH
3. Tool Compilation
cd ${INSTALL_DIR}/tools/hccl_test
# MPICH
make MPI_HOME=/usr/local/mpich ASCEND_DIR=${INSTALL_DIR}
# Open MPI
make MPI_HOME=/usr/local/openmpi ASCEND_DIR=${INSTALL_DIR}
ç¼è¯æååï¼å¨ bin ç®å½ä¸çæ 10 ä¸ªå¯æ§è¡æä»¶ã
4. Testing Scenarios
4.1 å¿«éè¿éæ§æµè¯
ç¨äºéªè¯ HCCL åºæ¬è¿éæ§ï¼æ°æ®éè¾å°ï¼æ§è¡é度快ï¼
# åæºå¿«éæµè¯
mpirun -n 8 ./bin/all_reduce_test -p 8 -b 8K -e 64M -f 2 -d fp32 -o sum
# 夿ºå¿«éæµè¯
mpirun -f hostfile -n 16 ./bin/all_reduce_test -p 8 -b 8K -e 64M -f 2 -d fp32 -o sum
4.2 宿´æ§è½æµè¯ï¼æ¨èï¼
ç¨äºæµè¯å¤§å¸¦å®½ç½ç»æ§è½ï¼æ°æ®éå° 1GBï¼æ´è½åæ å®é è®ç»åºæ¯ï¼
# åæºå®æ´æ§è½æµè¯
mpirun -n 8 ./bin/all_reduce_test -p 8 -b 8K -e 1G -f 2 -d fp32 -o sum
# 夿ºå®æ´æ§è½æµè¯ï¼æ¨èï¼
mpirun -f hostfile -n 16 ./bin/all_reduce_test -p 8 -b 8K -e 1G -f 2 -d fp32 -o sum
åæ°è¯´æï¼
-b 8K: èµ·å§æ°æ®é 8KB-e 1G: ç»ææ°æ®é 1GBï¼64M åªè½æµè¯å°æ°æ®éï¼-f 2: 乿³å åï¼æµè¯ 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1G
4.3 æ ¸å¿ç®åæµè¯
# ä½¿ç¨ quick-verify.sh æµè¯æ ¸å¿ç®åï¼æ¨èï¼
./scripts/quick-verify.sh 8
# æå¨æµè¯æ ¸å¿ç®å
./bin/all_reduce_test -p 8 -b 8K -e 1G -f 2 -d fp32 -o sum
./bin/all_gather_test -p 8 -b 8K -e 1G -f 2 -d fp32
4.4 Hostfile é ç½®
MPICH Format (èç¹IP:塿°)ï¼
# åæºæµè¯
175.99.1.3:8
# åæºæµè¯
175.99.1.3:8
175.99.1.4:8
请å°å±äºåä¸è¶ èç¹ç AI Server ä¿¡æ¯é ç½®å¨ä¸èµ·ï¼ä¸æ¯æäº¤åé ç½®ã
5. Parameters
5.1 æ ¸å¿åæ°éæ¥
| åæ° | 说æ | ç¤ºä¾ |
|---|---|---|
-p <npus> |
åèç¹åä¸è®ç»ç NPU ä¸ªæ° | -p 8 |
-b <size> |
æµè¯æ°æ®éèµ·å§å¼ | -b 8K |
-e <size> |
æµè¯æ°æ®éç»æå¼ | -e 1G |
-f <factor> |
乿³å å | -f 2 |
-d <type> |
æ°æ®ç±»å: fp32/fp16/int32 | -d fp32 |
-o <op> |
æä½ç±»å: sum/prod/max/min | -o sum |
-n <iters> |
è¿ä»£æ¬¡æ°ï¼é»è®¤ 20ï¼ | -n 20 |
-c <0/1> |
æ¯å¦å¼å¯ç»ææ ¡éªï¼é»è®¤ 1ï¼ | -c 1 |
详ç»åæ°è¯´æè§ references/parameters.md
5.2 æ°æ®éé 置示ä¾
# åºå®æ°æ®éæµè¯
-b 100M -e 100M
# 乿³å åæµè¯ï¼æµè¯ 100M, 200M, 400Mï¼
-b 100M -e 400M -f 2
# 宿´æ§è½æµè¯ï¼æ¨èï¼
-b 8K -e 1G -f 2
6. Results
6.1 è¾åºæ ¼å¼
data_size avg_time(us) alg_bandwidth(GB/s) check_result
8192 125.3 0.065 success
16384 132.1 0.124 success
...
| åæ®µ | 说æ |
|---|---|
data_size |
å个 NPU ä¸åä¸éåéä¿¡çæ°æ®éï¼Bytesï¼ |
avg_time |
éåéä¿¡ç®åæ§è¡èæ¶ï¼usï¼ |
alg_bandwidth |
ç®æ³å¸¦å®½ï¼GB/sï¼ï¼è®¡ç®æ¹å¼ï¼éåéä¿¡æ°æ®é/èæ¶ |
check_result |
ç»ææ ¡éªæ è¯ï¼success/failed/NULL |
6.2 ç»æè§£æ
# è§£æç»ææä»¶
./scripts/parse-hccl-result.py output.log
# è¾åº Markdown è¡¨æ ¼
./scripts/parse-hccl-result.py output.log -f markdown
7. Actual Test Results
7.1 åæº 16Ã910B3 æµè¯æ°æ®
æä»¬å¨åæº 16Ã910B3 ç¯å¢æµè¯äºå ¨é¨ 10 ç§ç®åï¼
| ç®å | ç»æ | 夿³¨ |
|---|---|---|
| AllReduce | â PASS | æé«å¸¦å®½ 48.59 GB/s (32MB) |
| AllGather | â PASS | – |
| AllGatherV | â FAIL | retcode: 5ï¼åé¿åæ°é®é¢ï¼ |
| AlltoAll | â PASS | – |
| AlltoAllV | â PASS | – |
| Broadcast | â PASS | – |
| Reduce | â PASS | – |
| ReduceScatter | â PASS | – |
| ReduceScatterV | â FAIL | retcode: 5ï¼åé¿åæ°é®é¢ï¼ |
| Scatter | â PASS | – |
éè¿çï¼8/10 (80%)ï¼æ ¸å¿ç®åå ¨é¨éè¿ã
7.2 æµè¯ç¯å¢åè
- æµè¯èç¹ï¼175.100.2.3, 175.100.2.4
- NPUï¼910B3 à 8 æ¯èç¹ï¼å ± 16 å¡ï¼
- CANNï¼25.3.rc1
- éä¿¡ç½å¡ï¼enp189s0f0
- MPIï¼MPICH 3.2.1
8. Common Issues
8.1 åç½®æ¡ä»¶é®é¢
| é®é¢ | åå | è§£å³æ¹æ³ |
|---|---|---|
| SSH å å¯ç»å½å¤±è´¥ | æªé ç½® SSH å¯é¥ | æ§è¡ ssh-copy-id é ç½®å å¯ç»å½ |
| CANN çæ¬ä¸ä¸è´ | 夿º CANN çæ¬ä¸å | ç»ä¸ææèç¹ç CANN çæ¬ |
| NPU Alarm ç¶æ | NPU 硬件æ é | æ£æ¥ npu-smi info -t healthï¼æé¤æ éå¡ |
8.2 è¿è¡æ¶é®é¢
| é®é¢ | åå | è§£å³æ¹æ³ |
|---|---|---|
| gethostbyname failed | 主æºåæ æ³è§£æ | é ç½® /etc/hosts |
| retcode: 7 | æ®ä½è¿ç¨å¹²æ° | æ§è¡æ¸
çå½ä»¤ï¼mpirun -f hostfile -n 16 pkill -9 -f "all_reduce_test" |
| retcode: 5 | åé¿åæ°é ç½®é误 | AllGatherV/ReduceScatterV éè¦ç¹æ®åæ°é ç½® |
8.3 å ¶ä»æ³¨æäºé¡¹
- Docker 容å¨ï¼å¦æä½¿ç¨ Docker 容å¨è¿è¡æµè¯ï¼éè¦ä½¿ç¨ host ç½ç»æ¨¡å¼
- æ¥å¿æ¥çï¼æµè¯å¤±è´¥æ¶å¯æ¥ç
~/ascend/log/debug/plogä¸çææ°æ¥å¿ - è¿ç¨æ¸ çï¼æµè¯åæ£æ¥å¡ä¸æ¯å¦æå ¶ä»è¿ç¨å ç¨ï¼å¦æéè¦æå¨æ¸ ç
è¯¦ç»æ éæé¤è§ references/common-issues.md
9. Scripts
9.1 åç½®æ£æ¥èæ¬
# 夿ºæµè¯åç½®æ£æ¥
./scripts/pre-test-check.sh 175.99.1.2 175.99.1.3
æ£æ¥å 容ï¼
- SSH å å¯ç»å½
- CANN çæ¬ä¸è´æ§
- NPU å¥åº·ç¶æ
- ç½ç»è¿éæ§
9.2 å¿«ééªè¯èæ¬
# æµè¯æ ¸å¿ç®å
./scripts/quick-verify.sh 8
# 宿´æ§è½æµè¯
./scripts/quick-verify.sh 8 full
9.3 夿ºæµè¯èæ¬
# ä¸é®å¤æºæµè¯ï¼èªå¨åç½®æ£æ¥ + æµè¯ï¼
./scripts/multi-node-test.sh --nodes 175.99.1.2,175.99.1.3 --npus 8 --mode full
Official References
- CANN 8.3.RC1: https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/devaids/hccltool/HCCLpertest_16_0001.html
- CANN 8.5: https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/devaids/hccltool/HCCLpertest_16_0001.html
- HCCL 常è§é®é¢: https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/devaids/hccltool/HCCLpertest_16_0008.html