What does this PR do?

This PR considerably simplifies how we do benchmarks. Instead of using entire pipeline-level benchmarks across different tasks, we will now ONLY benchmark the diffusion network that is the most compute-intensive part in a standard diffusion workflow.

To make the estimates more realistic, we will make use of pre-trained checkpoints and dummy inputs with reasonable dimensionalities.

I ran benchmarking_flux.py on an 80GB A100 on a batch size of 1 and got the following results:

Analyze the results in this Space: https://huggingface.co/spaces/diffusers/benchmark-analyzer

By default, all benchmarks will use a batch size of 1, eliminating CFG.

How to add your benchmark?

Adding benchmarks for a new model class (SanaTransformer2DModel, for example) boils down to the following:

Define the dummy inputs of the model.
Define the benchmarking scenarios we should run the benchmark on.

This is what benchmarking_flux.py does. More modularization can be shipped afterward.

Idea would be to merge this PR with pre-configured benchmarks for a few popular models and open others to the community.

TODOs

Utilities:

[x] To fire the execution of the individual model-level benchmarks sequentially.
[x] To combine CSVs from multiple different model classes.
[x] Central dataset update and Slack notification.

@DN6 could you give the approach a quick look? I can then work on resolving the TODOs.

May 16 '25 08:05 sayakpaul

Added SDXL, Wan (14B), and LTX (13B) on top of Flux:

Results

	scenario	model_cls	num_params_M	flops_M	time_plain_s	mem_plain_GB	time_compile_s	mem_compile_GB	fullgraph	mode
0	Wan-AI/Wan2.1-T2V-14B-Diffusers-bf16	WanTransformer3DModel	14288.5	7.85612e+08	10.797	31.17	8.974	31.77	1	default
1	Wan-AI/Wan2.1-T2V-14B-Diffusers-layerwise-upcasting	WanTransformer3DModel	14288.5	7.85612e+08	10.702	26.78	nan	nan	nan	nan
2	Wan-AI/Wan2.1-T2V-14B-Diffusers-group-offload-leaf	WanTransformer3DModel	14288.5	7.85612e+08	10.83	4.48	nan	nan	nan	nan
3	stabilityai/stable-diffusion-xl-base-1.0-bf16	UNet2DConditionModel	2567.46	5.9791e+06	0.085	5.05	0.058	5.39	1	default
4	stabilityai/stable-diffusion-xl-base-1.0-layerwise-upcasting	UNet2DConditionModel	2567.46	5.9791e+06	0.175	4.89	nan	nan	nan	nan
5	stabilityai/stable-diffusion-xl-base-1.0-group-offload-leaf	UNet2DConditionModel	2567.46	5.9791e+06	0.383	0.2	nan	nan	nan	nan
6	black-forest-labs/FLUX.1-dev-bf16	FluxTransformer2DModel	11901.4	5.95295e+07	0.535	22.61	0.388	22.85	1	default
7	black-forest-labs/FLUX.1-dev-bnb-nf4	FluxTransformer2DModel	5952.25	17263.8	0.574	6.7	nan	nan	nan	nan
8	black-forest-labs/FLUX.1-dev-layerwise-upcasting	FluxTransformer2DModel	11901.4	5.95295e+07	0.621	22.18	nan	nan	nan	nan
9	black-forest-labs/FLUX.1-dev-group-offload-leaf	FluxTransformer2DModel	11901.4	5.95295e+07	1.536	0.53	nan	nan	nan	nan
10	Lightricks/LTX-Video-0.9.7-dev-bf16	LTXVideoTransformer3DModel	13042.6	1.67583e+08	1.446	25.21	1.137	25.63	1	default
11	Lightricks/LTX-Video-0.9.7-dev-layerwise-upcasting	LTXVideoTransformer3DModel	13042.6	1.67583e+08	1.529	24.38	nan	nan	nan	nan
12	Lightricks/LTX-Video-0.9.7-dev-group-offload-leaf	LTXVideoTransformer3DModel	13042.6	1.67583e+08	1.917	1.04	nan	nan	nan	nan

May 20 '25 07:05 sayakpaul

Cc: @a-r-r-o-w if you want to add some caching benchmarks (in a later PR), I think that would be really great!

May 20 '25 11:05 sayakpaul

@DN6 this is ready for a review.

This is how the final CSV for this stage looks like: https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/collated_results.csv

I have confirmed in this run that it works as expected: https://github.com/huggingface/diffusers/actions/runs/15138495257/job/42570011907

May 20 '25 12:05 sayakpaul

Cc: @a-r-r-o-w if you want to add some caching benchmarks (in a later PR), I think that would be really great!

Sounds good, I'll take it up in near future once this PR is in

May 20 '25 12:05 a-r-r-o-w

@DN6 a gentle ping.

Jun 02 '25 06:06 sayakpaul

@McPatate I have updated this PR to add a DB population script and also adjusted the workflow that will trigger it.

I decided to create a new table because the benchmarking workflow differs from the one used in transformers and as such, it has served us well. So, we would like to stick to this format.

LMK what you think.

Jun 05 '25 03:06 sayakpaul

I decided to create a new table because the benchmarking workflow differs from the one used in transformers

Looking at your sql queries, everything you need is already present in the current db schema. I'd reuse it rather than create a new table. I understand the structure is a bit different but it shouldn't be too hard to make it fit:

create a benchmark (with INSERT INTO ... RETURNING benchmark_id)
create two sets of rows, one for device measurements and one for model measurements (I guess you don't even need device measurements as I don't see you measuring these values)
for each measurement, create a dict with:

        scenario,
        model_cls,
        num_params_M,
        flops_M,
        time_plain_s,
        mem_plain_GB,
        time_compile_s,
        mem_compile_GB,
        fullgraph,
        mode,

add the query to the batch:

"""
INSERT INTO model_measurements (
  benchmark_id,
  measurements
) VALUES (%s, %s)
"""

profit 😄

In my benchmark script for transformers I use argparse to get some args and pass them from the workflow

EDIT: don't forget to set your repository's name to differentiate from transformers, I didn't put NOT NULL constraints everywhere 😅)

Jun 05 '25 12:06 McPatate

create two sets of rows, one for device measurements and one for model measurements (I guess you don't even need device measurements as I don't see you measuring these values)

Yeah the current numbers suffice for the purpose.

Could you elaborate on the hesitation behind not creating a new table? This way, there's no chance to mess up with the transformers benchmarks and we maintain a clear separation.

For something that is working already well for observability purposes, I really don't want to introduce intrusive changes to the workflow unless there's a good reason to. You mentioned profit as well, whereas I fail to see it. So, please enlighten :D

Jun 05 '25 12:06 sayakpaul

Could you elaborate on the hesitation behind not creating a new table? This way, there's no chance to mess up with the transformers benchmarks and we maintain a clear separation.

I don't think there needs to be a separation of tables, it'll create redundancy, especially so given you'll converge towards a similar data model (benchmark table containing metadata with relation 1->N measurements table).

It does create a dependency between the two repos due to said data model, but I don't really see this being an issue since the measurements table is already a JSONB field and the benchmarks table shouldn't really change that often, if at all.

I understand the concerns about messing the data of the other repo and that separating things might be useful in a protective manner. Personally I definitely think that having a shared data model helps regarding re-using dashboards and having a consistent structure of benchmarks between repositories makes it easier to reason about. If you really feel like it's necessary, I'd create a duplicate table, diffusers_measurements or something of the kind but still use the benchmarks table, so a hybrid approach.

Overall, I don't think there's much risk: database isn't updated that often, so data levels will always be low / small. Db is backed up so if something bad happens we can restore with minimal data loss. Data loss isn't a huge deal, you can always re-run benchmarks at a given sha. So this is where I'm coming from when saying re-use existing schema, it'll make for less code maintenance in diffusers (while also potentially making it possible to re-use dashboards in grafana, although I don't really see that happening yet).

I really don't want to introduce intrusive changes to the workflow

Nothing intrusive about building upon existing work :)

Jun 05 '25 13:06 McPatate

Okay will roll with it.

Jun 05 '25 13:06 sayakpaul

@DN6 a gentle ping.

Jun 23 '25 08:06 sayakpaul

@anijain2305 just a ping to let you know that we're merging this PR which will run the benchmarking suite bi-weekly and report the results here: https://huggingface.co/datasets/diffusers/benchmarks/blob/main/collated_results.csv

Jul 02 '25 13:07 sayakpaul

Thanks for setting this up. This will be really helpful for tracking progress and identifying regression

Jul 02 '25 14:07 anijain2305

Since everything is passing now, will merge this PR :) https://github.com/huggingface/diffusers/actions/runs/16065987231/job/45340516693

Jul 04 '25 05:07 sayakpaul

Also cc @a-r-r-o-w for https://github.com/huggingface/diffusers/pull/11565#issuecomment-2893942224 (not urgent, when you get time).

Jul 04 '25 05:07 sayakpaul

[benchmarks] overhaul benchmarks

What does this PR do?

How to add your benchmark?

TODOs