diffusers Diffusion Transformers quantization

Is your feature request related to a problem? Please describe.

Due to OpenAI's DALLE-3 and Sora's soaring success, many projects, such as StableDiffusion3, PixArt and Open-Sora, are trying to replicate its architecture, whose backbone is a Diffusion Transformer (DiT). Because it is a transformer, the same quantization principles could be applied, making the models more accessible with lower VRAM and faster speed. This is going to be especially useful for extending the context window of text2video models, as it allows them to have longer video lengths.

Describe the solution you'd like

Add 8bits/4bits quantization for DiT/PixArt-like diffusion transformers. It would be nice to have an equivalent of load_in_4bits=True when loading the pretrained models.

Describe alternatives you've considered

Running the diffusion transformer as is with higher memory consumption and lower speed.

Additional context

DiT Pipeline in Diffusers https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/dit.md PixArt Pipeline in Diffusers https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/pixart.md Open-Sora project for video generation with (ST)DiT https://hpcaitech.github.io/Open-Sora/ (has code and the checkpoint!)

Crosslink to the issue in Open-Sora: https://github.com/hpcaitech/Open-Sora/issues/128

Mar 18 '24 14:03 kabachuha

Let's join forces with 🤗 quanto!

Mar 18 '24 15:03 tolgacangoz

Let's join forces with 🤗 quanto!

Awesome discussion by Sayak and David (maintainer of quanto): https://github.com/huggingface/diffusers/discussions/7023

Mar 18 '24 16:03 a-r-r-o-w

This issue is more specific to the transformer (without unet blocks) architecture though, I think, which already has tailored methods such as bitsandbytes and AutoGPTQ

Mar 18 '24 16:03 kabachuha

There's a little problem however :)

Things like LLM.int8(), AutoGPTQ, etc. -- all those are quite specific to the LLM arena. Yes, I am aware that the base and foundation architecture isn't changing much here but their pretraining varies substantially. Hence, these methods aren't exactly transferrable.

See https://github.com/huggingface/diffusers/issues/6500 for an elaborate discussion. Cc: @younesbelkada for awareness.

Mar 19 '24 08:03 sayakpaul

Hi ! Thanks everyone ! yes it could makes sense to leverage existing quantization mthods for LLMs on transformer blocks, by replacing linear layers with quantized linear layers. One could try that out with bitsandbytes Linear4bit layers One could also use quanto to quantize the entire module as quanto supports quantized conv2d layers, i belive @sayakpaul and @dacorvo had some experiments with that !

Mar 19 '24 09:03 younesbelkada

And yes see #6500 for more details

Mar 19 '24 09:03 younesbelkada

For diffusers you already can load pipeline with dtype float8 but it only to save vram because ob pipeline you need do upcast to fp16 or bf16 to inference. Im trying to find how train diffusera model in fp8

Apr 07 '24 04:04 elismasilva

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 01 '24 15:05 github-actions[bot]

@kabachuha have you tried hqq? Happy to assist if you need help to make it work.

May 10 '24 07:05 mobicham

@kabachuha We have recently trained a ternary DiT from scratch and open-sourced it. Maybe you can find more information here

May 24 '24 10:05 Lucky-Lance

Nice work @Lucky-Lance !

May 24 '24 11:05 mobicham

@Lucky-Lance Impressive! 👀

May 24 '24 11:05 kabachuha

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]