[WIP] Update benchmark data
Summary
Rerun all benchmarks scripts to get the latest data, so we can have a reliable baseline for future optimization.
Note: orpo failing with compile=True (plotting with old data for now), qwen2vl_mrope script failed.
A complete comparison figure will be uploaded in this PR later.
Fused Linear Chunked Loss
Alignment
-
[x] CPO
speed
-
[x] DPO
speed
-
[x] KTO
speed
-
[x] ORPO
speed
-
[x] SimPO
speed
Distillation
- [x] JSD
speed
Others
-
[x] Cross Entropy
speed
-
[x] Fused Linear Cross Entropy
speed
-
[x] JSD
speed
-
[ ] Fused Linear JSD
speed
-
[x] DyT
speed
-
[x] Embedding
speed
-
[x] GeGLU
speed
-
[x] GroupNorm
speed
-
[x] KL Div
speed
-
[x] LayerNorm
speed
-
[x] RMSNorm
speed
-
[x] RoPE
speed
-
[ ] Swiglu
speed
-
[x] TVD
speed
Testing Done
- Hardware Type: <BLANK>
- [ ] run
make testto ensure correctness - [ ] run
make checkstyleto ensure code style - [ ] run
make test-convergenceto ensure convergence
@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:
- Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
- For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
- This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?
Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.
Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.
Strong +1, which can also help detect performance regression early.
@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:
- Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
- For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
- This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?
1 I don't think we need to keep the old data. 2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release. 3 I think we still need the A100 data in the near future.
@yundai424 @lancerts
Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.
Totally agree! An official benchmark result is defintely better.
1 I don't think we need to keep the old data. 2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release. 3 I think we still need the A100 data in the near future.
Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.
Is it possible to setup a scheduled ci to periodically udpate the nightly benchmark?
If so, instead of the current all_benchmakr_data, we can create two benchmark data files, one for version release (full benchmark) and the other for nightly (simple benchmark). The release one keeps a complete benchmark result in the latest version as the current one does. The nightly one can hold multiple recent results (10-20 commits or weeks/months), but only with the most representative config, e.g., batch_size, seq_len, hidden_size, vocab_size of llama. In this way, we can set x-axis to date and visualize it for readibility. Best case scenario, we can plot it in online/offline docs.
Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.
agree 🤔 ideally something like https://hud.pytorch.org/benchmark/compilers and host the results somewhere else on server so we don't flush git history with bunch of benchmark numbers..
@Manan17 is an intern in our team who will be working on this.
Hello, So we discussed an approach to solve this: Running all the benchmarks take less than an hour. I tried it on a single H100 GPU and it took me 55 minutes. So, what we thought that we can run the benchmark scripts after every pr is merged to the main repo. Similar to what we do for docs, we will run a the benchmark script after anything is pushed to the main branch of the repo. Initially the data of the benchmark will be stored in the gh-pages branch. Once the script is run (make run-benchmarks) the new data will be appended to the csv file in the gh-pages branch with another column which would be github-commit-hash. So now, for every commit we will have the benchmark data, and it will be visualized using github pages on for example: https://linkedin.github.io/Liger-Kernel/benchmarks. Using javascript the data will be visualized using charts just like (https://hud.pytorch.org/benchmark/compilers). There would be filters and user can see benchmark improvements for speed and memory. Let me know what you guys think about this. @Tcc0403 @yundai424 @lancerts
@Manan17 Sound great! Let's open an issue to discuss instead of this PR, since it's not likely to be merged anyway 😅
update benchmark data with ci instead. See #744
@Tcc0403 reopening to fix the benchmark tests
@shimizust It's good to review.