diffusers Tencent Hunyuan Team: add HunyuanDiT related updates

This PR did the following things:

Created HunyuanDiTPipeline in src/diffusers/pipelines/hunyuandit/ and HunyuanDiT2DModel in ./src/diffusers/models/transformers/.
To support HunyuanDiT2DModel, added HunyuanDiTBlock and helper functions in src/diffusers/models/attention.py .
Uploaded the safetensors model to my huggingface: XCLiu/HunyuanDiT-0523
Tested the output of the migrated model+code is the same as our repo (https://github.com/Tencent/HunyuanDiT). Have tested different resolutions and batch sizes > 1 and made sure they work correctly.

In this branch, you can run HunyuanDiT in FP32 with:

python3 test_hunyuan_dit.py

which includes the following codes:

import torch
from diffusers import HunyuanDiTPipeline

pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", torch_dtype=torch.float32)
pipe.to('cuda')

### NOTE: HunyuanDiT supports both Chinese and English inputs
prompt = "一个宇航员在骑马"
#prompt = "An astronaut riding a horse"
image = pipe(height=1024, width=1024, prompt=prompt).images[0]

image.save("./img.png")

Dependency: maybe the timm package

TODO lists:

FP16 support: I didn't change the parameter use_fp16 in HunyuanDiTPipeline.__call__(). The reason is BertModel does not support FP16 quantization. In our repo we only quantize the diffusion transformer to FP16. I guess there must be some smart way to support FP16.
Simplify and refactor the HunyuanDiTBlock related codes in src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py.
Refactor the pipeline and HunyuanDiT2DModel to diffusers style.
doc

Thank you so much! I'll be there and help with everything.

cc: @sayakpaul @yiyixuxu

May 23 '24 09:05 gnobitab

Hi:

I removed HunyuanDiTAttention and HunyuanDiTCrossAttention to use our Attention class with a HunyuanAttnProcessor2_0 attention processor instead,

feel free to test out the PR branch and cherry-pick this commit https://github.com/huggingface/diffusers/pull/8265/commits/3f85b1d257a9184b38772fa54997d334da1e1fae if results are ok to you

I included a testing script here https://github.com/huggingface/diffusers/pull/8265#issue-2314186991

May 24 '24 02:05 yiyixuxu

Hi I did the following things:

cleaned the pipeline/transformer codes according to Sayak's suggestions.
switched to yiyi's refactored attention.
new test code in test_hunyuan_dit.py. based on yiyi's test code, switched norm2 and norm3, fixed generator to seed 0. the image should be:

For now, I will not change the remote state_dict XCLiu/HunyuanDiT-0523. will update after we finished everything.

Please review and comment, thx!

@sayakpaul @yiyixuxu

May 24 '24 14:05 gnobitab

I made some improvements according to Sayak and Yiyi's suggestions.

Several additional problems :

Our BertModel cannot work in FP16. So to make our pipeline to FP16, we have to make only the transformer to FP16 instead of just pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", transformer=model, torch_dtype=torch.float16)
I left some comments in the above conversations regarding the magic numbers. Please help me on the deisgn choice.

May 27 '24 10:05 gnobitab

Our BertModel cannot work in FP16. So to make our pipeline to FP16, we have to make only the transformer to FP16 instead of just pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", transformer=model, torch_dtype=torch.float16)

Why wouldn't it work, though? Could you provide more details here? In order for the pipeline to operate in torch.float16 all the components need to be in torch.float16, or we have deal with it differently like so:

https://github.com/huggingface/diffusers/blob/b3d10d6d65a80593627c6738fbeded2f69b5129f/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1264

Additionally, could you see if this worked so that we can avoid self.text_encoder.pooler.to_empty(device='cpu')?

May 27 '24 12:05 sayakpaul

I refactored the model here https://github.com/huggingface/diffusers/pull/8310 you can see the changed I made in this commit https://github.com/huggingface/diffusers/pull/8310/commits/b0e0da28c4d0f057824faacb23da6b06dd43a786

I did below things:

refactor HunyuanDiT2DModel: removed HunyuanDiTPatchEmbed, HunyuanDiTTimestepEmbedder
refactor apply_rotary_emb: moved all the get_2d_rotary_pos_embed_* functions and apply_rotary_emb to embeddings.py
moved HunyuanDiTBlock and other Hunyuan-specific block to the same file as HunyuanDiT2DModel

I changed arg names to be more aligned with our transformer models and blocks also removed bunch of functionalities that are not used in this implementation - let me know if I did anything wrong or any of the changes does not make sense!

feel free to just pick the commit and make any modifications on your PR.

May 29 '24 07:05 yiyixuxu

Our BertModel cannot work in FP16. So to make our pipeline to FP16, we have to make only the transformer to FP16 instead of just pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", transformer=model, torch_dtype=torch.float16)

Why wouldn't it work, though? Could you provide more details here? In order for the pipeline to operate in torch.float16 all the components need to be in torch.float16, or we have deal with it differently like so:

https://github.com/huggingface/diffusers/blob/b3d10d6d65a80593627c6738fbeded2f69b5129f/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1264

Additionally, could you see if this worked so that we can avoid self.text_encoder.pooler.to_empty(device='cpu')?

Hi Sayak, my server was down. I restarted the sever, reinstalled the environment, and the model works with FP16 now... I have no idea why, but the current environment is,

torch==2.0.1
transformers==4.41.1

I guess the reason is I'm using the latest transformers library now. @sayakpaul

May 30 '24 06:05 gnobitab

Okay great. It seems like the FP16 problem and also the to_empty() problem (solution here) are solved now?

May 30 '24 06:05 sayakpaul

I pushed a new version. In this version:

Checked and Merged Yiyi's refactor in https://github.com/huggingface/diffusers/commit/b0e0da28c4d0f057824faacb23da6b06dd43a786 (PR https://github.com/huggingface/diffusers/pull/8310)
Fixed FP16 and to_empty problem.
polished the whole codebase following the new comments

The new test file is updated in test_hunyuan_dit.py

Thank you for the help! Please review the new version @yiyixuxu @sayakpaul

May 30 '24 13:05 gnobitab

Note: (1) Fix FP16: update the transformers to 4.41.1 or later. (2) to_empty: https://github.com/huggingface/diffusers/pull/8240/files#r1617108976

from transformers import BertModel
bert_model = BertModel.from_pretrained("XCLiu/HunyuanDiT-0523", add_pooling_layer=True, subfolder="text_encoder")

pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", text_encoder=bert_model, transformer=model, torch_dtype=torch.float32)
pipe.to('cuda')
pipe.save_pretrained("HunyuanDiT-ckpt")

del pipe

pipe = HunyuanDiTPipeline.from_pretrained("HunyuanDiT-ckpt")
pipe.cuda()

May 30 '24 14:05 gnobitab

cc @DN6 for a final review - I left a few questions for you, I think they can be addressed in a follow-up PR too

May 30 '24 20:05 yiyixuxu

@sayakpaul can you help look into, or provide guidance on optimization? i.e. make sure all our optimization methods works on HunyuanDIT:)

May 30 '24 20:05 yiyixuxu

@sayakpaul can you help look into, or provide guidance on optimization? i.e. make sure all our optimization methods works on HunyuanDIT:)

Sure, I will give it a look. Memory optimization-wise, do we want to add feed-forward chunking and QKV fusion? If so, I can add that in a separate commit. Or would you rather handle it? Lwt me know. Cc: @gnobitab too.

May 30 '24 23:05 sayakpaul

memory optimization-wise, do we want to add feed-forward chunking and QKV fusion? If so, I can add that in a separate commit

yes please!

May 30 '24 23:05 yiyixuxu

thanks!! I think we can merge this soon

We have tests and doc left - doc can be added in a separate PR if you need more time, but let's quickly add a test, you can use pixart_sigma test as reference https://github.com/huggingface/diffusers/tree/main/tests/pipelines/pixart_sigma

@yiyixuxu What kind of docs should I add?

May 31 '24 01:05 gnobitab

I changed XCLiu/HunayunDiT-0523 and simplified test_hunyuan_dit.py

May 31 '24 02:05 gnobitab

@gnobitab

What kind of docs should I add?

we can add a page here https://github.com/huggingface/diffusers/tree/main/docs/source/en/api/pipelines here is some examples:

pixart sigma https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/pixart_sigma.md

dit https://huggingface.co/docs/diffusers/api/pipelines/dit

May 31 '24 03:05 yiyixuxu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 01 '24 02:06 HuggingFaceDocBuilderDev

merging now!

I left a bunch to-dos in the comments too

for @sayakpaul can you help with optimization? make sure torch.compile works and all the other meaningful optimization method works here

for @gnobitab

add doc
add more tests if possible
I left a few questions for you about the ada layer norm and the scheduler
move the checkpoint to the official account.
do we need a conversion script? cc @sayakpaul too. is the diffuser checkpoint the main checkpoint that is going to be used by the community?

here is the script that works now for the checkpoint in my PR to your repo https://huggingface.co/XCLiu/HunyuanDiT-0523/discussions/2


# integration test (hunyuan dit)
import torch

from diffusers import HunyuanDiTPipeline


device = "cuda"
dtype = torch.float16

repo = "XCLiu/HunyuanDiT-0523"

pipe = HunyuanDiTPipeline.from_pretrained(repo, revision="refs/pr/2", torch_dtype=dtype)
pipe.enable_model_cpu_offload()

### NOTE: HunyuanDiT supports both Chinese and English inputs
prompt = "一个宇航员在骑马"
#prompt = "An astronaut riding a horse"
generator=torch.Generator(device="cuda").manual_seed(0)
image = pipe(height=1024, width=1024, prompt=prompt, generator=generator).images[0]

image.save("yiyi_test_out.png")

and this is the script I used to convert the current checkpoint in "XCLiu/HunyuanDiT-0523" basically I did these two changes:

changed the module/folder name for t5 text_encoder and tokenizer to text_encoder_2 and tokenizer_2
updated the state dict for transformers because I refactored the embedding layers for the extra conditions

import torch
from huggingface_hub import hf_hub_download

from diffusers import HunyuanDiTPipeline, HunyuanDiT2DModel
from transformers import T5EncoderModel, T5Tokenizer

import safetensors.torch

device = "cuda"
dtype = torch.float32

repo = "XCLiu/HunyuanDiT-0523"

tokenizer_2 = T5Tokenizer.from_pretrained(repo, subfolder = "tokenizer_t5")
text_encoder_2 = T5EncoderModel.from_pretrained(repo, subfolder = "embedder_t5", torch_dtype=dtype)

model_config = HunyuanDiT2DModel.load_config("XCLiu/HunyuanDiT-0523", subfolder="transformer")
model = HunyuanDiT2DModel.from_config(model_config).to(device)

ckpt_path = hf_hub_download(
    "XCLiu/HunyuanDiT-0523",
    filename ="diffusion_pytorch_model.safetensors",
    subfolder="transformer",
)

state_dict = safetensors.torch.load_file(ckpt_path)

prefix = "time_extra_emb."

# time_embedding.linear_1 -> timestep_embedder.linear_1 
state_dict[f"{prefix}timestep_embedder.linear_1.weight"] = state_dict["time_embedding.linear_1.weight"]
state_dict[f"{prefix}timestep_embedder.linear_1.bias"] = state_dict["time_embedding.linear_1.bias"]
state_dict.pop("time_embedding.linear_1.weight")
state_dict.pop("time_embedding.linear_1.bias")

# time_embedding.linear_2 -> timestep_embedder.linear_2
state_dict[f"{prefix}timestep_embedder.linear_2.weight"] = state_dict["time_embedding.linear_2.weight"]
state_dict[f"{prefix}timestep_embedder.linear_2.bias"] = state_dict["time_embedding.linear_2.bias"]
state_dict.pop("time_embedding.linear_2.weight")
state_dict.pop("time_embedding.linear_2.bias")

# pooler.positional_embedding
state_dict[f"{prefix}pooler.positional_embedding"] = state_dict["pooler.positional_embedding"]
state_dict.pop("pooler.positional_embedding")

# pooler.k_proj
state_dict[f"{prefix}pooler.k_proj.weight"] = state_dict["pooler.k_proj.weight"]
state_dict[f"{prefix}pooler.k_proj.bias"] = state_dict["pooler.k_proj.bias"]
state_dict.pop("pooler.k_proj.weight")
state_dict.pop("pooler.k_proj.bias")

#pooler.q_proj
state_dict[f"{prefix}pooler.q_proj.weight"] = state_dict["pooler.q_proj.weight"]
state_dict[f"{prefix}pooler.q_proj.bias"] = state_dict["pooler.q_proj.bias"]
state_dict.pop("pooler.q_proj.weight")
state_dict.pop("pooler.q_proj.bias")

#  pooler.v_proj
state_dict[f"{prefix}pooler.v_proj.weight"] = state_dict["pooler.v_proj.weight"]
state_dict[f"{prefix}pooler.v_proj.bias"] = state_dict["pooler.v_proj.bias"]
state_dict.pop("pooler.v_proj.weight")
state_dict.pop("pooler.v_proj.bias")

# pooler.c_proj
state_dict[f"{prefix}pooler.c_proj.weight"] = state_dict["pooler.c_proj.weight"]
state_dict[f"{prefix}pooler.c_proj.bias"] = state_dict["pooler.c_proj.bias"]
state_dict.pop("pooler.c_proj.weight")
state_dict.pop("pooler.c_proj.bias")

# style_embedder.weight
state_dict[f"{prefix}style_embedder.weight"] = state_dict["style_embedder.weight"]
state_dict.pop("style_embedder.weight")

# extra_embedder.linear_1
state_dict[f"{prefix}extra_embedder.linear_1.weight"] = state_dict["extra_embedder.linear_1.weight"]
state_dict[f"{prefix}extra_embedder.linear_1.bias"] = state_dict["extra_embedder.linear_1.bias"]
state_dict.pop("extra_embedder.linear_1.weight")
state_dict.pop("extra_embedder.linear_1.bias")

# extra_embedder.linear_2
state_dict[f"{prefix}extra_embedder.linear_2.weight"] = state_dict["extra_embedder.linear_2.weight"]
state_dict[f"{prefix}extra_embedder.linear_2.bias"] = state_dict["extra_embedder.linear_2.bias"]
state_dict.pop("extra_embedder.linear_2.weight")
state_dict.pop("extra_embedder.linear_2.bias")

model.load_state_dict(state_dict)
model.to(dtype)

pipe = HunyuanDiTPipeline.from_pretrained(
    repo, 
    tokenizer_2 = tokenizer_2,
    text_encoder_2 = text_encoder_2,
    transformer = model,
    torch_dtype=dtype)

Jun 01 '24 22:06 yiyixuxu

@yiyixuxu

Thanks for merging!

Reply to your TODOs:

Doc added here: https://moon-ci-docs.huggingface.co/docs/diffusers/pr_8383/en/api/pipelines/hunyuandit
I manually tested and verified the results. The current version has the same output as the original model. I will try to add more rigorous tests later.
(1) AdaLayerNormShift: I asked the team member and they were following the practice in SDXL: See https://github.com/huggingface/diffusers/blob/413604405fddb4692a8e9a9a9fb6c353d22881ea/src/diffusers/models/resnet.py#L343 (L343 - L351). I notice the place of self.norm in HunyuanDiT is different from SDXL, but I think it is just a small mistake. As far as I learned, it is not a scientific innovation and I suggest to keep it inside hunyuan_transformer_2d.py

(2) Scheduler: I tested several fast samplers. From my test, I think it is safe to switch from ``DDPMSchedulertoDDIMScheduler`. Other schedulers seem to be fine but I would like to leave that for the community.

4 and 5: I merged your PR in XCLiu/HunyuanDiT-0523 and saved a new checkpoint for the diffusers pipeline. We are moving the new checkpoint to the official account after going through some internal checks. The name will be Tencent-Hunyuan/HunyuanDiT-Diffusers. Let's change the example doc in the pipeline file when it's officially online.

Jun 03 '24 14:06 gnobitab