[benchmarks] overhaul benchmarks
What does this PR do?
This PR considerably simplifies how we do benchmarks. Instead of using entire pipeline-level benchmarks across different tasks, we will now ONLY benchmark the diffusion network that is the most compute-intensive part in a standard diffusion workflow.
To make the estimates more realistic, we will make use of pre-trained checkpoints and dummy inputs with reasonable dimensionalities.
I ran benchmarking_flux.py on an 80GB A100 on a batch size of 1 and got the following results:
Analyze the results in this Space: https://huggingface.co/spaces/diffusers/benchmark-analyzer
By default, all benchmarks will use a batch size of 1, eliminating CFG.
How to add your benchmark?
Adding benchmarks for a new model class (SanaTransformer2DModel, for example) boils down to the following:
- Define the dummy inputs of the model.
- Define the benchmarking scenarios we should run the benchmark on.
This is what benchmarking_flux.py does. More modularization can be shipped afterward.
Idea would be to merge this PR with pre-configured benchmarks for a few popular models and open others to the community.
TODOs
Utilities:
- [x] To fire the execution of the individual model-level benchmarks sequentially.
- [x] To combine CSVs from multiple different model classes.
- [x] Central dataset update and Slack notification.
@DN6 could you give the approach a quick look? I can then work on resolving the TODOs.
Added SDXL, Wan (14B), and LTX (13B) on top of Flux:
Results
| scenario | model_cls | num_params_M | flops_M | time_plain_s | mem_plain_GB | time_compile_s | mem_compile_GB | fullgraph | mode | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wan-AI/Wan2.1-T2V-14B-Diffusers-bf16 | WanTransformer3DModel | 14288.5 | 7.85612e+08 | 10.797 | 31.17 | 8.974 | 31.77 | 1 | default |
| 1 | Wan-AI/Wan2.1-T2V-14B-Diffusers-layerwise-upcasting | WanTransformer3DModel | 14288.5 | 7.85612e+08 | 10.702 | 26.78 | nan | nan | nan | nan |
| 2 | Wan-AI/Wan2.1-T2V-14B-Diffusers-group-offload-leaf | WanTransformer3DModel | 14288.5 | 7.85612e+08 | 10.83 | 4.48 | nan | nan | nan | nan |
| 3 | stabilityai/stable-diffusion-xl-base-1.0-bf16 | UNet2DConditionModel | 2567.46 | 5.9791e+06 | 0.085 | 5.05 | 0.058 | 5.39 | 1 | default |
| 4 | stabilityai/stable-diffusion-xl-base-1.0-layerwise-upcasting | UNet2DConditionModel | 2567.46 | 5.9791e+06 | 0.175 | 4.89 | nan | nan | nan | nan |
| 5 | stabilityai/stable-diffusion-xl-base-1.0-group-offload-leaf | UNet2DConditionModel | 2567.46 | 5.9791e+06 | 0.383 | 0.2 | nan | nan | nan | nan |
| 6 | black-forest-labs/FLUX.1-dev-bf16 | FluxTransformer2DModel | 11901.4 | 5.95295e+07 | 0.535 | 22.61 | 0.388 | 22.85 | 1 | default |
| 7 | black-forest-labs/FLUX.1-dev-bnb-nf4 | FluxTransformer2DModel | 5952.25 | 17263.8 | 0.574 | 6.7 | nan | nan | nan | nan |
| 8 | black-forest-labs/FLUX.1-dev-layerwise-upcasting | FluxTransformer2DModel | 11901.4 | 5.95295e+07 | 0.621 | 22.18 | nan | nan | nan | nan |
| 9 | black-forest-labs/FLUX.1-dev-group-offload-leaf | FluxTransformer2DModel | 11901.4 | 5.95295e+07 | 1.536 | 0.53 | nan | nan | nan | nan |
| 10 | Lightricks/LTX-Video-0.9.7-dev-bf16 | LTXVideoTransformer3DModel | 13042.6 | 1.67583e+08 | 1.446 | 25.21 | 1.137 | 25.63 | 1 | default |
| 11 | Lightricks/LTX-Video-0.9.7-dev-layerwise-upcasting | LTXVideoTransformer3DModel | 13042.6 | 1.67583e+08 | 1.529 | 24.38 | nan | nan | nan | nan |
| 12 | Lightricks/LTX-Video-0.9.7-dev-group-offload-leaf | LTXVideoTransformer3DModel | 13042.6 | 1.67583e+08 | 1.917 | 1.04 | nan | nan | nan | nan |
Cc: @a-r-r-o-w if you want to add some caching benchmarks (in a later PR), I think that would be really great!
@DN6 this is ready for a review.
This is how the final CSV for this stage looks like: https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/collated_results.csv
I have confirmed in this run that it works as expected: https://github.com/huggingface/diffusers/actions/runs/15138495257/job/42570011907
Cc: @a-r-r-o-w if you want to add some caching benchmarks (in a later PR), I think that would be really great!
Sounds good, I'll take it up in near future once this PR is in
@DN6 a gentle ping.
@McPatate I have updated this PR to add a DB population script and also adjusted the workflow that will trigger it.
I decided to create a new table because the benchmarking workflow differs from the one used in transformers and as such, it has served us well. So, we would like to stick to this format.
LMK what you think.
I decided to create a new table because the benchmarking workflow differs from the one used in transformers
Looking at your sql queries, everything you need is already present in the current db schema. I'd reuse it rather than create a new table. I understand the structure is a bit different but it shouldn't be too hard to make it fit:
- create a benchmark (with
INSERT INTO ... RETURNING benchmark_id) - create two sets of rows, one for device measurements and one for model measurements (I guess you don't even need device measurements as I don't see you measuring these values)
- for each measurement, create a dict with:
scenario,
model_cls,
num_params_M,
flops_M,
time_plain_s,
mem_plain_GB,
time_compile_s,
mem_compile_GB,
fullgraph,
mode,
- add the query to the batch:
"""
INSERT INTO model_measurements (
benchmark_id,
measurements
) VALUES (%s, %s)
"""
- profit 😄
In my benchmark script for transformers I use argparse to get some args and pass them from the workflow
EDIT: don't forget to set your repository's name to differentiate from transformers, I didn't put NOT NULL constraints everywhere 😅)
create two sets of rows, one for device measurements and one for model measurements (I guess you don't even need device measurements as I don't see you measuring these values)
Yeah the current numbers suffice for the purpose.
Could you elaborate on the hesitation behind not creating a new table? This way, there's no chance to mess up with the transformers benchmarks and we maintain a clear separation.
For something that is working already well for observability purposes, I really don't want to introduce intrusive changes to the workflow unless there's a good reason to. You mentioned profit as well, whereas I fail to see it. So, please enlighten :D
Could you elaborate on the hesitation behind not creating a new table? This way, there's no chance to mess up with the transformers benchmarks and we maintain a clear separation.
I don't think there needs to be a separation of tables, it'll create redundancy, especially so given you'll converge towards a similar data model (benchmark table containing metadata with relation 1->N measurements table).
It does create a dependency between the two repos due to said data model, but I don't really see this being an issue since the measurements table is already a JSONB field and the benchmarks table shouldn't really change that often, if at all.
I understand the concerns about messing the data of the other repo and that separating things might be useful in a protective manner. Personally I definitely think that having a shared data model helps regarding re-using dashboards and having a consistent structure of benchmarks between repositories makes it easier to reason about. If you really feel like it's necessary, I'd create a duplicate table, diffusers_measurements or something of the kind but still use the benchmarks table, so a hybrid approach.
Overall, I don't think there's much risk: database isn't updated that often, so data levels will always be low / small. Db is backed up so if something bad happens we can restore with minimal data loss. Data loss isn't a huge deal, you can always re-run benchmarks at a given sha. So this is where I'm coming from when saying re-use existing schema, it'll make for less code maintenance in diffusers (while also potentially making it possible to re-use dashboards in grafana, although I don't really see that happening yet).
I really don't want to introduce intrusive changes to the workflow
Nothing intrusive about building upon existing work :)
Okay will roll with it.
@DN6 a gentle ping.
@anijain2305 just a ping to let you know that we're merging this PR which will run the benchmarking suite bi-weekly and report the results here: https://huggingface.co/datasets/diffusers/benchmarks/blob/main/collated_results.csv
Thanks for setting this up. This will be really helpful for tracking progress and identifying regression
Since everything is passing now, will merge this PR :) https://github.com/huggingface/diffusers/actions/runs/16065987231/job/45340516693
Also cc @a-r-r-o-w for https://github.com/huggingface/diffusers/pull/11565#issuecomment-2893942224 (not urgent, when you get time).