superbenchmark
superbenchmark copied to clipboard
V0.13.0 Release Plan
Release Manager
@cp5555
Endgame
- [ ] Code freeze: Oct, 2025
- [ ] Bug Bash date: TBD
- [ ] Release date: TBD
Main Features
SuperBench Improvement
-
- [x] Add cuda13.0.dockerfile support (#739)
-
- [x] Add nsys and pytorch profiler debug trace support (#744)
Micro-benchmark Improvement
-
- [ ] Collect per-snapshot per-GPU flops/temp in gpu burn (#735)
-
- [x] Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth (#736)
-
- [x] Add ncu profile support in cublaslt-gemm (#740)
-
- [x] Support verification and parallel run for disk performance benchmark (#741)
-
- [x] Add numa support for nvbandwidth (#742)
-
- [x] Change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference (#732)
-
- [ ] Support gemm correctness check in cublaslt-gemm
-
- [ ] Multi node nccl validation enhancement
-
- [ ] mscclpp support
-
- [ ] Add new busbw metrics for NCCL/MSCCL testing with specific algorithm
-
- [ ] Fix NVBandwidth benchmark results parsing bug
-
- [ ] Support FP4 kernels for cutlass benchmark
Model Benchmark Improvement
-
- [x] Add option to exclude data copy time in model benchmarks (#734)
-
- [ ] Support state-of-art LLM model training perf including Deepseek, qwen
-
- [ ] Support state-of-art LLM model inference perf including Deepseek, qwen
-
- [ ] Support state-of-art LLM module and model correctness benchmark
-
- [ ] Deterministic training support (#731)
Bug fix
-
- [ ] dist-inference raise cublaslt error
-
- [ ] Add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)
-
- [ ] NVBandwidth benchmark results parsing bug (#748)
-
- [x] CI/CD - Fix image merge in GitHub Action (#749)
-
- [x] Fix pipelines - Update mlc version in dockerfiles from v3.11 to v3.12 (#752)
-
- [x] CI/CD - Fix python3.10 pipeline (#753)
-
- [x] CI/CD - Fix Azure test pipeline (#754)
Tools
-
- [ ] System info enhancement