superbenchmark icon indicating copy to clipboard operation
superbenchmark copied to clipboard

V0.6.0 Test Plan

Open yukirora opened this issue 3 years ago • 0 comments

V0.6.0 Test Plan

Test Cases

single-node test

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 1 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Not started
NDm A100 v4 1 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1 Not started
Hayabusa 1* 16 * MI200 PyTorch 1.9 ROCm 5.1 Not started

single-node Micro-benchmark Test

  1. [x] ib-loopback
  • [x] Fix issues in ib loopback benchmark (#369)
  • [x] Fix stability issue in ib loopback benchmark (#386)
  • [x] fix port conflict in ib loopback (#375)
  1. [x] Rccl-test/nccl-test
  • [x] Update Dockerfile for NCCL/RCCL version, tag name, and verbose output (#371)
  • [x] Support node_num=1 in mpi mode(#372)

SuperBench Improvement

  • [x] Support running on host directly without Docker(#356, #358, #362)
  • [x] Support automatic configuration yaml selection on Azure VM
  • [x] Add return code for Timeout(#383,#385)
  • [ ] Support ROCm 5.1.1 (#353, #354), Support ROCm 5.1.3 (#361)

Tools

  1. [x] data diagnosis
  • [x] Fix bugs in data diagnosis (#355)
  • [x] Add failure check function in data_diagnosis.py (#378)
  • [x] Support Json and Jsonl in Diagnosis. (#388)
  • [x] Add support to store values of metrics in data diagnosis. (#392)

multiple-node test

Test Table

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 16 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Not started

distributed Micro-benchmark test

  1. [ ] ib-traffic
  • [x] Support multiple IB/GPU Pair-wise IB benchmark (#363)
  • [x] Bug Fix in IB benchmark in all-pair mode(#370, #377)
  • [ ] Topology-aware IB benchmark (#373, #381)

yukirora avatar Aug 22 '22 06:08 yukirora