superbenchmark
superbenchmark copied to clipboard
V0.6.0 Test Plan
V0.6.0 Test Plan
Test Cases
single-node test
| Machine Type | #Node * #GPU * GPU Type | PyTorch Version | Accelerated Computing Toolkit | Status |
|---|---|---|---|---|
| ND A100 v4 | 1 * 8 * A100 40GB SXM | PyTorch 1.8 | CUDA 11.1 | Not started |
| NDm A100 v4 | 1 * 8 * A100 80GB SXM | PyTorch 1.8 | CUDA 11.1 | Not started |
| Hayabusa | 1* 16 * MI200 | PyTorch 1.9 | ROCm 5.1 | Not started |
single-node Micro-benchmark Test
- [x] ib-loopback
- [x] Fix issues in ib loopback benchmark (#369)
- [x] Fix stability issue in ib loopback benchmark (#386)
- [x] fix port conflict in ib loopback (#375)
- [x] Rccl-test/nccl-test
- [x] Update Dockerfile for NCCL/RCCL version, tag name, and verbose output (#371)
- [x] Support node_num=1 in mpi mode(#372)
SuperBench Improvement
- [x] Support running on host directly without Docker(#356, #358, #362)
- [x] Support automatic configuration yaml selection on Azure VM
- [x] Add return code for Timeout(#383,#385)
- [ ] Support ROCm 5.1.1 (#353, #354), Support ROCm 5.1.3 (#361)
Tools
- [x] data diagnosis
- [x] Fix bugs in data diagnosis (#355)
- [x] Add failure check function in data_diagnosis.py (#378)
- [x] Support Json and Jsonl in Diagnosis. (#388)
- [x] Add support to store values of metrics in data diagnosis. (#392)
multiple-node test
Test Table
| Machine Type | #Node * #GPU * GPU Type | PyTorch Version | Accelerated Computing Toolkit | Status |
|---|---|---|---|---|
| ND A100 v4 | 16 * 8 * A100 40GB SXM | PyTorch 1.8 | CUDA 11.1 | Not started |
distributed Micro-benchmark test
- [ ] ib-traffic
- [x] Support multiple IB/GPU Pair-wise IB benchmark (#363)
- [x] Bug Fix in IB benchmark in all-pair mode(#370, #377)
- [ ] Topology-aware IB benchmark (#373, #381)