Liger-Kernel [WIP] Update benchmark data

Summary

Rerun all benchmarks scripts to get the latest data, so we can have a reliable baseline for future optimization.

Note: orpo failing with compile=True (plotting with old data for now), qwen2vl_mrope script failed.

A complete comparison figure will be uploaded in this PR later.

Fused Linear Chunked Loss

Alignment

[x] CPO
speed
[x] DPO
speed
[x] KTO
speed
[x] ORPO
speed
[x] SimPO
speed

Distillation

[x] JSD
speed

Others

[x] Cross Entropy
speed
[x] Fused Linear Cross Entropy
speed
[x] JSD
speed
[ ] Fused Linear JSD
speed
[x] DyT
speed
[x] Embedding
speed
[x] GeGLU
speed
[x] GroupNorm
speed
[x] KL Div
speed
[x] LayerNorm
speed
[x] RMSNorm
speed
[x] RoPE
speed
[ ] Swiglu
speed
[x] TVD
speed

Testing Done

Hardware Type: <BLANK>
[ ] run make test to ensure correctness
[ ] run make checkstyle to ensure code style
[ ] run make test-convergence to ensure convergence

Apr 02 '25 10:04 Tcc0403

@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

Apr 07 '25 12:04 Tcc0403

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Apr 07 '25 18:04 yundai424

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Strong +1, which can also help detect performance regression early.

Apr 07 '25 18:04 lancerts

@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?

For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?

This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

1 I don't think we need to keep the old data. 2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release. 3 I think we still need the A100 data in the near future.

Apr 07 '25 18:04 lancerts

@yundai424 @lancerts

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Totally agree! An official benchmark result is defintely better.

1 I don't think we need to keep the old data. 2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release. 3 I think we still need the A100 data in the near future.

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

Is it possible to setup a scheduled ci to periodically udpate the nightly benchmark?

If so, instead of the current all_benchmakr_data, we can create two benchmark data files, one for version release (full benchmark) and the other for nightly (simple benchmark). The release one keeps a complete benchmark result in the latest version as the current one does. The nightly one can hold multiple recent results (10-20 commits or weeks/months), but only with the most representative config, e.g., batch_size, seq_len, hidden_size, vocab_size of llama. In this way, we can set x-axis to date and visualize it for readibility. Best case scenario, we can plot it in online/offline docs.

Apr 08 '25 12:04 Tcc0403

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

agree 🤔 ideally something like https://hud.pytorch.org/benchmark/compilers and host the results somewhere else on server so we don't flush git history with bunch of benchmark numbers..

Apr 08 '25 16:04 yundai424

@Manan17 is an intern in our team who will be working on this.

Jun 04 '25 20:06 vaibhavjindal

Hello, So we discussed an approach to solve this: Running all the benchmarks take less than an hour. I tried it on a single H100 GPU and it took me 55 minutes. So, what we thought that we can run the benchmark scripts after every pr is merged to the main repo. Similar to what we do for docs, we will run a the benchmark script after anything is pushed to the main branch of the repo. Initially the data of the benchmark will be stored in the gh-pages branch. Once the script is run (make run-benchmarks) the new data will be appended to the csv file in the gh-pages branch with another column which would be github-commit-hash. So now, for every commit we will have the benchmark data, and it will be visualized using github pages on for example: https://linkedin.github.io/Liger-Kernel/benchmarks. Using javascript the data will be visualized using charts just like (https://hud.pytorch.org/benchmark/compilers). There would be filters and user can see benchmark improvements for speed and memory. Let me know what you guys think about this. @Tcc0403 @yundai424 @lancerts

Jun 05 '25 00:06 Manan17

@Manan17 Sound great! Let's open an issue to discuss instead of this PR, since it's not likely to be merged anyway 😅

Jun 05 '25 01:06 Tcc0403

update benchmark data with ci instead. See #744

Jun 05 '25 01:06 Tcc0403

@Tcc0403 reopening to fix the benchmark tests

Jul 08 '25 22:07 shimizust

@shimizust It's good to review.

Jul 22 '25 15:07 Tcc0403