diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[benchmarks] overhaul benchmarks

Open sayakpaul opened this issue 9 months ago • 11 comments

What does this PR do?

This PR considerably simplifies how we do benchmarks. Instead of using entire pipeline-level benchmarks across different tasks, we will now ONLY benchmark the diffusion network that is the most compute-intensive part in a standard diffusion workflow.

To make the estimates more realistic, we will make use of pre-trained checkpoints and dummy inputs with reasonable dimensionalities.

I ran benchmarking_flux.py on an 80GB A100 on a batch size of 1 and got the following results:

image

Analyze the results in this Space: https://huggingface.co/spaces/diffusers/benchmark-analyzer

By default, all benchmarks will use a batch size of 1, eliminating CFG.

How to add your benchmark?

Adding benchmarks for a new model class (SanaTransformer2DModel, for example) boils down to the following:

  1. Define the dummy inputs of the model.
  2. Define the benchmarking scenarios we should run the benchmark on.

This is what benchmarking_flux.py does. More modularization can be shipped afterward.

Idea would be to merge this PR with pre-configured benchmarks for a few popular models and open others to the community.

TODOs

Utilities:

  • [x] To fire the execution of the individual model-level benchmarks sequentially.
  • [x] To combine CSVs from multiple different model classes.
  • [x] Central dataset update and Slack notification.

@DN6 could you give the approach a quick look? I can then work on resolving the TODOs.

sayakpaul avatar May 16 '25 08:05 sayakpaul

Added SDXL, Wan (14B), and LTX (13B) on top of Flux:

Results
scenario model_cls num_params_M flops_M time_plain_s mem_plain_GB time_compile_s mem_compile_GB fullgraph mode
0 Wan-AI/Wan2.1-T2V-14B-Diffusers-bf16 WanTransformer3DModel 14288.5 7.85612e+08 10.797 31.17 8.974 31.77 1 default
1 Wan-AI/Wan2.1-T2V-14B-Diffusers-layerwise-upcasting WanTransformer3DModel 14288.5 7.85612e+08 10.702 26.78 nan nan nan nan
2 Wan-AI/Wan2.1-T2V-14B-Diffusers-group-offload-leaf WanTransformer3DModel 14288.5 7.85612e+08 10.83 4.48 nan nan nan nan
3 stabilityai/stable-diffusion-xl-base-1.0-bf16 UNet2DConditionModel 2567.46 5.9791e+06 0.085 5.05 0.058 5.39 1 default
4 stabilityai/stable-diffusion-xl-base-1.0-layerwise-upcasting UNet2DConditionModel 2567.46 5.9791e+06 0.175 4.89 nan nan nan nan
5 stabilityai/stable-diffusion-xl-base-1.0-group-offload-leaf UNet2DConditionModel 2567.46 5.9791e+06 0.383 0.2 nan nan nan nan
6 black-forest-labs/FLUX.1-dev-bf16 FluxTransformer2DModel 11901.4 5.95295e+07 0.535 22.61 0.388 22.85 1 default
7 black-forest-labs/FLUX.1-dev-bnb-nf4 FluxTransformer2DModel 5952.25 17263.8 0.574 6.7 nan nan nan nan
8 black-forest-labs/FLUX.1-dev-layerwise-upcasting FluxTransformer2DModel 11901.4 5.95295e+07 0.621 22.18 nan nan nan nan
9 black-forest-labs/FLUX.1-dev-group-offload-leaf FluxTransformer2DModel 11901.4 5.95295e+07 1.536 0.53 nan nan nan nan
10 Lightricks/LTX-Video-0.9.7-dev-bf16 LTXVideoTransformer3DModel 13042.6 1.67583e+08 1.446 25.21 1.137 25.63 1 default
11 Lightricks/LTX-Video-0.9.7-dev-layerwise-upcasting LTXVideoTransformer3DModel 13042.6 1.67583e+08 1.529 24.38 nan nan nan nan
12 Lightricks/LTX-Video-0.9.7-dev-group-offload-leaf LTXVideoTransformer3DModel 13042.6 1.67583e+08 1.917 1.04 nan nan nan nan

sayakpaul avatar May 20 '25 07:05 sayakpaul

Cc: @a-r-r-o-w if you want to add some caching benchmarks (in a later PR), I think that would be really great!

sayakpaul avatar May 20 '25 11:05 sayakpaul

@DN6 this is ready for a review.

This is how the final CSV for this stage looks like: https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/collated_results.csv

I have confirmed in this run that it works as expected: https://github.com/huggingface/diffusers/actions/runs/15138495257/job/42570011907

sayakpaul avatar May 20 '25 12:05 sayakpaul

Cc: @a-r-r-o-w if you want to add some caching benchmarks (in a later PR), I think that would be really great!

Sounds good, I'll take it up in near future once this PR is in

a-r-r-o-w avatar May 20 '25 12:05 a-r-r-o-w

@DN6 a gentle ping.

sayakpaul avatar Jun 02 '25 06:06 sayakpaul

@McPatate I have updated this PR to add a DB population script and also adjusted the workflow that will trigger it.

I decided to create a new table because the benchmarking workflow differs from the one used in transformers and as such, it has served us well. So, we would like to stick to this format.

LMK what you think.

sayakpaul avatar Jun 05 '25 03:06 sayakpaul

I decided to create a new table because the benchmarking workflow differs from the one used in transformers

Looking at your sql queries, everything you need is already present in the current db schema. I'd reuse it rather than create a new table. I understand the structure is a bit different but it shouldn't be too hard to make it fit:

  • create a benchmark (with INSERT INTO ... RETURNING benchmark_id)
  • create two sets of rows, one for device measurements and one for model measurements (I guess you don't even need device measurements as I don't see you measuring these values)
  • for each measurement, create a dict with:
        scenario,
        model_cls,
        num_params_M,
        flops_M,
        time_plain_s,
        mem_plain_GB,
        time_compile_s,
        mem_compile_GB,
        fullgraph,
        mode,
  • add the query to the batch:
"""
INSERT INTO model_measurements (
  benchmark_id,
  measurements
) VALUES (%s, %s)
"""
  • profit 😄

In my benchmark script for transformers I use argparse to get some args and pass them from the workflow

EDIT: don't forget to set your repository's name to differentiate from transformers, I didn't put NOT NULL constraints everywhere 😅)

McPatate avatar Jun 05 '25 12:06 McPatate

create two sets of rows, one for device measurements and one for model measurements (I guess you don't even need device measurements as I don't see you measuring these values)

Yeah the current numbers suffice for the purpose.

Could you elaborate on the hesitation behind not creating a new table? This way, there's no chance to mess up with the transformers benchmarks and we maintain a clear separation.

For something that is working already well for observability purposes, I really don't want to introduce intrusive changes to the workflow unless there's a good reason to. You mentioned profit as well, whereas I fail to see it. So, please enlighten :D

sayakpaul avatar Jun 05 '25 12:06 sayakpaul

Could you elaborate on the hesitation behind not creating a new table? This way, there's no chance to mess up with the transformers benchmarks and we maintain a clear separation.

I don't think there needs to be a separation of tables, it'll create redundancy, especially so given you'll converge towards a similar data model (benchmark table containing metadata with relation 1->N measurements table).

It does create a dependency between the two repos due to said data model, but I don't really see this being an issue since the measurements table is already a JSONB field and the benchmarks table shouldn't really change that often, if at all.

I understand the concerns about messing the data of the other repo and that separating things might be useful in a protective manner. Personally I definitely think that having a shared data model helps regarding re-using dashboards and having a consistent structure of benchmarks between repositories makes it easier to reason about. If you really feel like it's necessary, I'd create a duplicate table, diffusers_measurements or something of the kind but still use the benchmarks table, so a hybrid approach.

Overall, I don't think there's much risk: database isn't updated that often, so data levels will always be low / small. Db is backed up so if something bad happens we can restore with minimal data loss. Data loss isn't a huge deal, you can always re-run benchmarks at a given sha. So this is where I'm coming from when saying re-use existing schema, it'll make for less code maintenance in diffusers (while also potentially making it possible to re-use dashboards in grafana, although I don't really see that happening yet).

I really don't want to introduce intrusive changes to the workflow

Nothing intrusive about building upon existing work :)

McPatate avatar Jun 05 '25 13:06 McPatate

Okay will roll with it.

sayakpaul avatar Jun 05 '25 13:06 sayakpaul

@DN6 a gentle ping.

sayakpaul avatar Jun 23 '25 08:06 sayakpaul

@anijain2305 just a ping to let you know that we're merging this PR which will run the benchmarking suite bi-weekly and report the results here: https://huggingface.co/datasets/diffusers/benchmarks/blob/main/collated_results.csv

sayakpaul avatar Jul 02 '25 13:07 sayakpaul

Thanks for setting this up. This will be really helpful for tracking progress and identifying regression

anijain2305 avatar Jul 02 '25 14:07 anijain2305

Since everything is passing now, will merge this PR :) https://github.com/huggingface/diffusers/actions/runs/16065987231/job/45340516693

sayakpaul avatar Jul 04 '25 05:07 sayakpaul

Also cc @a-r-r-o-w for https://github.com/huggingface/diffusers/pull/11565#issuecomment-2893942224 (not urgent, when you get time).

sayakpaul avatar Jul 04 '25 05:07 sayakpaul