Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

[WIP] Update benchmark data

Open Tcc0403 opened this issue 10 months ago • 6 comments

Summary

Rerun all benchmarks scripts to get the latest data, so we can have a reliable baseline for future optimization.

Note: orpo failing with compile=True (plotting with old data for now), qwen2vl_mrope script failed.

A complete comparison figure will be uploaded in this PR later.

Fused Linear Chunked Loss

Alignment

  • [x] CPO

    speedfused_linear_cpo_loss_speed

  • [x] DPO

    speeddpo_loss_speed

  • [x] KTO

    speedkto_loss_speed

  • [x] ORPO

    speedfused_linear_orpo_loss_speed

  • [x] SimPO

    speedfused_linear_simpo_loss_speed

Distillation

  • [x] JSD
    speeddistill_jsd_loss_speed

Others

  • [x] Cross Entropy

    speedcross_entropy_speed

  • [x] Fused Linear Cross Entropy

    speedfused_linear_cross_entropy_speed

  • [x] JSD

    speedjsd_speed

  • [ ] Fused Linear JSD

    speed

  • [x] DyT

    speeddyt_speed

  • [x] Embedding

    speedembedding_speed

  • [x] GeGLU

    speedgeglu_speed

  • [x] GroupNorm

    speedgroup_norm_speed

  • [x] KL Div

    speedkl_div_speed

  • [x] LayerNorm

    speedlayer_norm_speed

  • [x] RMSNorm

    speedrms_norm_speed

  • [x] RoPE

    speedrope_speed

  • [ ] Swiglu

    speed

  • [x] TVD

    speedtvd_speed

Testing Done

  • Hardware Type: <BLANK>
  • [ ] run make test to ensure correctness
  • [ ] run make checkstyle to ensure code style
  • [ ] run make test-convergence to ensure convergence

Tcc0403 avatar Apr 02 '25 10:04 Tcc0403

@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

  1. Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
  2. For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
  3. This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

Tcc0403 avatar Apr 07 '25 12:04 Tcc0403

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

yundai424 avatar Apr 07 '25 18:04 yundai424

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Strong +1, which can also help detect performance regression early.

lancerts avatar Apr 07 '25 18:04 lancerts

@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

  1. Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
  2. For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
  3. This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

1 I don't think we need to keep the old data. 2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release. 3 I think we still need the A100 data in the near future.

lancerts avatar Apr 07 '25 18:04 lancerts

@yundai424 @lancerts

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Totally agree! An official benchmark result is defintely better.

1 I don't think we need to keep the old data. 2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release. 3 I think we still need the A100 data in the near future.

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

Is it possible to setup a scheduled ci to periodically udpate the nightly benchmark?

If so, instead of the current all_benchmakr_data, we can create two benchmark data files, one for version release (full benchmark) and the other for nightly (simple benchmark). The release one keeps a complete benchmark result in the latest version as the current one does. The nightly one can hold multiple recent results (10-20 commits or weeks/months), but only with the most representative config, e.g., batch_size, seq_len, hidden_size, vocab_size of llama. In this way, we can set x-axis to date and visualize it for readibility. Best case scenario, we can plot it in online/offline docs.

Tcc0403 avatar Apr 08 '25 12:04 Tcc0403

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

agree 🤔 ideally something like https://hud.pytorch.org/benchmark/compilers and host the results somewhere else on server so we don't flush git history with bunch of benchmark numbers..

yundai424 avatar Apr 08 '25 16:04 yundai424

@Manan17 is an intern in our team who will be working on this.

vaibhavjindal avatar Jun 04 '25 20:06 vaibhavjindal

Hello, So we discussed an approach to solve this: Running all the benchmarks take less than an hour. I tried it on a single H100 GPU and it took me 55 minutes. So, what we thought that we can run the benchmark scripts after every pr is merged to the main repo. Similar to what we do for docs, we will run a the benchmark script after anything is pushed to the main branch of the repo. Initially the data of the benchmark will be stored in the gh-pages branch. Once the script is run (make run-benchmarks) the new data will be appended to the csv file in the gh-pages branch with another column which would be github-commit-hash. So now, for every commit we will have the benchmark data, and it will be visualized using github pages on for example: https://linkedin.github.io/Liger-Kernel/benchmarks. Using javascript the data will be visualized using charts just like (https://hud.pytorch.org/benchmark/compilers). There would be filters and user can see benchmark improvements for speed and memory. Let me know what you guys think about this. @Tcc0403 @yundai424 @lancerts

Manan17 avatar Jun 05 '25 00:06 Manan17

@Manan17 Sound great! Let's open an issue to discuss instead of this PR, since it's not likely to be merged anyway 😅

Tcc0403 avatar Jun 05 '25 01:06 Tcc0403

update benchmark data with ci instead. See #744

Tcc0403 avatar Jun 05 '25 01:06 Tcc0403

@Tcc0403 reopening to fix the benchmark tests

shimizust avatar Jul 08 '25 22:07 shimizust

@shimizust It's good to review.

Tcc0403 avatar Jul 22 '25 15:07 Tcc0403