V0.6.0 Test Plan

Open yukirora opened this issue 3 years ago • 0 comments

V0.6.0 Test Plan

Test Cases

single-node test

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Not started
NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Not started
Hayabusa	1* 16 * MI200	PyTorch 1.9	ROCm 5.1	Not started

single-node Micro-benchmark Test

[x] ib-loopback

[x] Fix issues in ib loopback benchmark (#369)

[x] Fix stability issue in ib loopback benchmark (#386)

[x] fix port conflict in ib loopback (#375)

[x] Rccl-test/nccl-test

[x] Update Dockerfile for NCCL/RCCL version, tag name, and verbose output (#371)

[x] Support node_num=1 in mpi mode(#372)

SuperBench Improvement

[x] Support running on host directly without Docker(#356, #358, #362)

[x] Support automatic configuration yaml selection on Azure VM

[x] Add return code for Timeout(#383,#385)

[ ] Support ROCm 5.1.1 (#353, #354), Support ROCm 5.1.3 (#361)

Tools

[x] data diagnosis

[x] Fix bugs in data diagnosis (#355)

[x] Add failure check function in data_diagnosis.py (#378)

[x] Support Json and Jsonl in Diagnosis. (#388)

[x] Add support to store values of metrics in data diagnosis. (#392)

multiple-node test

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	16 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Not started

distributed Micro-benchmark test

[ ] ib-traffic

[x] Support multiple IB/GPU Pair-wise IB benchmark (#363)

[x] Bug Fix in IB benchmark in all-pair mode(#370, #377)

[ ] Topology-aware IB benchmark (#373, #381)

Aug 22 '22 06:08 yukirora