Tencent Hunyuan Team: add HunyuanDiT related updates
This PR did the following things:
- Created
HunyuanDiTPipelineinsrc/diffusers/pipelines/hunyuandit/andHunyuanDiT2DModelin./src/diffusers/models/transformers/. - To support
HunyuanDiT2DModel, addedHunyuanDiTBlockand helper functions insrc/diffusers/models/attention.py. - Uploaded the safetensors model to my huggingface:
XCLiu/HunyuanDiT-0523 - Tested the output of the migrated model+code is the same as our repo (https://github.com/Tencent/HunyuanDiT). Have tested different resolutions and batch sizes > 1 and made sure they work correctly.
In this branch, you can run HunyuanDiT in FP32 with:
python3 test_hunyuan_dit.py
which includes the following codes:
import torch
from diffusers import HunyuanDiTPipeline
pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", torch_dtype=torch.float32)
pipe.to('cuda')
### NOTE: HunyuanDiT supports both Chinese and English inputs
prompt = "一个宇航员在骑马"
#prompt = "An astronaut riding a horse"
image = pipe(height=1024, width=1024, prompt=prompt).images[0]
image.save("./img.png")
Dependency:
maybe the timm package
TODO lists:
- FP16 support: I didn't change the parameter
use_fp16inHunyuanDiTPipeline.__call__(). The reason isBertModeldoes not support FP16 quantization. In our repo we only quantize the diffusion transformer to FP16. I guess there must be some smart way to support FP16. - Simplify and refactor the
HunyuanDiTBlockrelated codes insrc/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py. - Refactor the pipeline and HunyuanDiT2DModel to diffusers style.
- doc
Thank you so much! I'll be there and help with everything.
cc: @sayakpaul @yiyixuxu
Hi:
I removed HunyuanDiTAttention and HunyuanDiTCrossAttention to use our Attention class with a HunyuanAttnProcessor2_0 attention processor instead,
feel free to test out the PR branch and cherry-pick this commit https://github.com/huggingface/diffusers/pull/8265/commits/3f85b1d257a9184b38772fa54997d334da1e1fae if results are ok to you
I included a testing script here https://github.com/huggingface/diffusers/pull/8265#issue-2314186991
Hi I did the following things:
- cleaned the pipeline/transformer codes according to Sayak's suggestions.
- switched to yiyi's refactored attention.
- new test code in
test_hunyuan_dit.py. based on yiyi's test code, switched norm2 and norm3, fixed generator to seed 0. the image should be:
For now, I will not change the remote state_dict XCLiu/HunyuanDiT-0523. will update after we finished everything.
Please review and comment, thx!
@sayakpaul @yiyixuxu
I made some improvements according to Sayak and Yiyi's suggestions.
Several additional problems :
-
Our BertModel cannot work in FP16. So to make our pipeline to FP16, we have to make only the transformer to FP16 instead of just
pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", transformer=model, torch_dtype=torch.float16) -
I left some comments in the above conversations regarding the magic numbers. Please help me on the deisgn choice.
Our BertModel cannot work in FP16. So to make our pipeline to FP16, we have to make only the transformer to FP16 instead of just pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", transformer=model, torch_dtype=torch.float16)
Why wouldn't it work, though? Could you provide more details here? In order for the pipeline to operate in torch.float16 all the components need to be in torch.float16, or we have deal with it differently like so:
https://github.com/huggingface/diffusers/blob/b3d10d6d65a80593627c6738fbeded2f69b5129f/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1264
Additionally, could you see if this worked so that we can avoid self.text_encoder.pooler.to_empty(device='cpu')?
I refactored the model here https://github.com/huggingface/diffusers/pull/8310 you can see the changed I made in this commit https://github.com/huggingface/diffusers/pull/8310/commits/b0e0da28c4d0f057824faacb23da6b06dd43a786
I did below things:
- refactor
HunyuanDiT2DModel: removedHunyuanDiTPatchEmbed,HunyuanDiTTimestepEmbedder - refactor
apply_rotary_emb: moved all theget_2d_rotary_pos_embed_*functions andapply_rotary_embtoembeddings.py - moved
HunyuanDiTBlockand other Hunyuan-specific block to the same file asHunyuanDiT2DModel
I changed arg names to be more aligned with our transformer models and blocks also removed bunch of functionalities that are not used in this implementation - let me know if I did anything wrong or any of the changes does not make sense!
feel free to just pick the commit and make any modifications on your PR.
Our BertModel cannot work in FP16. So to make our pipeline to FP16, we have to make only the transformer to FP16 instead of just pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", transformer=model, torch_dtype=torch.float16)
Why wouldn't it work, though? Could you provide more details here? In order for the pipeline to operate in
torch.float16all the components need to be intorch.float16, or we have deal with it differently like so:https://github.com/huggingface/diffusers/blob/b3d10d6d65a80593627c6738fbeded2f69b5129f/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1264
Additionally, could you see if this worked so that we can avoid
self.text_encoder.pooler.to_empty(device='cpu')?
Hi Sayak, my server was down. I restarted the sever, reinstalled the environment, and the model works with FP16 now... I have no idea why, but the current environment is,
torch==2.0.1
transformers==4.41.1
I guess the reason is I'm using the latest transformers library now. @sayakpaul
Okay great. It seems like the FP16 problem and also the to_empty() problem (solution here) are solved now?
I pushed a new version. In this version:
- Checked and Merged Yiyi's refactor in https://github.com/huggingface/diffusers/commit/b0e0da28c4d0f057824faacb23da6b06dd43a786 (PR https://github.com/huggingface/diffusers/pull/8310)
- Fixed FP16 and
to_emptyproblem. - polished the whole codebase following the new comments
The new test file is updated in test_hunyuan_dit.py
Thank you for the help! Please review the new version @yiyixuxu @sayakpaul
Note:
(1) Fix FP16: update the transformers to 4.41.1 or later.
(2) to_empty: https://github.com/huggingface/diffusers/pull/8240/files#r1617108976
from transformers import BertModel
bert_model = BertModel.from_pretrained("XCLiu/HunyuanDiT-0523", add_pooling_layer=True, subfolder="text_encoder")
pipe = HunyuanDiTPipeline.from_pretrained("XCLiu/HunyuanDiT-0523", text_encoder=bert_model, transformer=model, torch_dtype=torch.float32)
pipe.to('cuda')
pipe.save_pretrained("HunyuanDiT-ckpt")
del pipe
pipe = HunyuanDiTPipeline.from_pretrained("HunyuanDiT-ckpt")
pipe.cuda()
cc @DN6 for a final review - I left a few questions for you, I think they can be addressed in a follow-up PR too
@sayakpaul can you help look into, or provide guidance on optimization? i.e. make sure all our optimization methods works on HunyuanDIT:)
@sayakpaul can you help look into, or provide guidance on optimization? i.e. make sure all our optimization methods works on HunyuanDIT:)
Sure, I will give it a look. Memory optimization-wise, do we want to add feed-forward chunking and QKV fusion? If so, I can add that in a separate commit. Or would you rather handle it? Lwt me know. Cc: @gnobitab too.
memory optimization-wise, do we want to add feed-forward chunking and QKV fusion? If so, I can add that in a separate commit
yes please!
thanks!! I think we can merge this soon
We have tests and doc left - doc can be added in a separate PR if you need more time, but let's quickly add a test, you can use pixart_sigma test as reference https://github.com/huggingface/diffusers/tree/main/tests/pipelines/pixart_sigma
@yiyixuxu What kind of docs should I add?
I changed XCLiu/HunayunDiT-0523 and simplified test_hunyuan_dit.py
@gnobitab
What kind of docs should I add?
we can add a page here https://github.com/huggingface/diffusers/tree/main/docs/source/en/api/pipelines here is some examples:
pixart sigma https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/pixart_sigma.md
dit https://huggingface.co/docs/diffusers/api/pipelines/dit
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
merging now!
I left a bunch to-dos in the comments too
for @sayakpaul
can you help with optimization? make sure torch.compile works and all the other meaningful optimization method works here
for @gnobitab
- add doc
- add more tests if possible
- I left a few questions for you about the ada layer norm and the scheduler
- move the checkpoint to the official account.
- do we need a conversion script? cc @sayakpaul too. is the diffuser checkpoint the main checkpoint that is going to be used by the community?
here is the script that works now for the checkpoint in my PR to your repo https://huggingface.co/XCLiu/HunyuanDiT-0523/discussions/2
# integration test (hunyuan dit)
import torch
from diffusers import HunyuanDiTPipeline
device = "cuda"
dtype = torch.float16
repo = "XCLiu/HunyuanDiT-0523"
pipe = HunyuanDiTPipeline.from_pretrained(repo, revision="refs/pr/2", torch_dtype=dtype)
pipe.enable_model_cpu_offload()
### NOTE: HunyuanDiT supports both Chinese and English inputs
prompt = "一个宇航员在骑马"
#prompt = "An astronaut riding a horse"
generator=torch.Generator(device="cuda").manual_seed(0)
image = pipe(height=1024, width=1024, prompt=prompt, generator=generator).images[0]
image.save("yiyi_test_out.png")
and this is the script I used to convert the current checkpoint in "XCLiu/HunyuanDiT-0523"
basically I did these two changes:
- changed the module/folder name for t5 text_encoder and tokenizer to
text_encoder_2andtokenizer_2 - updated the state dict for transformers because I refactored the embedding layers for the extra conditions
import torch
from huggingface_hub import hf_hub_download
from diffusers import HunyuanDiTPipeline, HunyuanDiT2DModel
from transformers import T5EncoderModel, T5Tokenizer
import safetensors.torch
device = "cuda"
dtype = torch.float32
repo = "XCLiu/HunyuanDiT-0523"
tokenizer_2 = T5Tokenizer.from_pretrained(repo, subfolder = "tokenizer_t5")
text_encoder_2 = T5EncoderModel.from_pretrained(repo, subfolder = "embedder_t5", torch_dtype=dtype)
model_config = HunyuanDiT2DModel.load_config("XCLiu/HunyuanDiT-0523", subfolder="transformer")
model = HunyuanDiT2DModel.from_config(model_config).to(device)
ckpt_path = hf_hub_download(
"XCLiu/HunyuanDiT-0523",
filename ="diffusion_pytorch_model.safetensors",
subfolder="transformer",
)
state_dict = safetensors.torch.load_file(ckpt_path)
prefix = "time_extra_emb."
# time_embedding.linear_1 -> timestep_embedder.linear_1
state_dict[f"{prefix}timestep_embedder.linear_1.weight"] = state_dict["time_embedding.linear_1.weight"]
state_dict[f"{prefix}timestep_embedder.linear_1.bias"] = state_dict["time_embedding.linear_1.bias"]
state_dict.pop("time_embedding.linear_1.weight")
state_dict.pop("time_embedding.linear_1.bias")
# time_embedding.linear_2 -> timestep_embedder.linear_2
state_dict[f"{prefix}timestep_embedder.linear_2.weight"] = state_dict["time_embedding.linear_2.weight"]
state_dict[f"{prefix}timestep_embedder.linear_2.bias"] = state_dict["time_embedding.linear_2.bias"]
state_dict.pop("time_embedding.linear_2.weight")
state_dict.pop("time_embedding.linear_2.bias")
# pooler.positional_embedding
state_dict[f"{prefix}pooler.positional_embedding"] = state_dict["pooler.positional_embedding"]
state_dict.pop("pooler.positional_embedding")
# pooler.k_proj
state_dict[f"{prefix}pooler.k_proj.weight"] = state_dict["pooler.k_proj.weight"]
state_dict[f"{prefix}pooler.k_proj.bias"] = state_dict["pooler.k_proj.bias"]
state_dict.pop("pooler.k_proj.weight")
state_dict.pop("pooler.k_proj.bias")
#pooler.q_proj
state_dict[f"{prefix}pooler.q_proj.weight"] = state_dict["pooler.q_proj.weight"]
state_dict[f"{prefix}pooler.q_proj.bias"] = state_dict["pooler.q_proj.bias"]
state_dict.pop("pooler.q_proj.weight")
state_dict.pop("pooler.q_proj.bias")
# pooler.v_proj
state_dict[f"{prefix}pooler.v_proj.weight"] = state_dict["pooler.v_proj.weight"]
state_dict[f"{prefix}pooler.v_proj.bias"] = state_dict["pooler.v_proj.bias"]
state_dict.pop("pooler.v_proj.weight")
state_dict.pop("pooler.v_proj.bias")
# pooler.c_proj
state_dict[f"{prefix}pooler.c_proj.weight"] = state_dict["pooler.c_proj.weight"]
state_dict[f"{prefix}pooler.c_proj.bias"] = state_dict["pooler.c_proj.bias"]
state_dict.pop("pooler.c_proj.weight")
state_dict.pop("pooler.c_proj.bias")
# style_embedder.weight
state_dict[f"{prefix}style_embedder.weight"] = state_dict["style_embedder.weight"]
state_dict.pop("style_embedder.weight")
# extra_embedder.linear_1
state_dict[f"{prefix}extra_embedder.linear_1.weight"] = state_dict["extra_embedder.linear_1.weight"]
state_dict[f"{prefix}extra_embedder.linear_1.bias"] = state_dict["extra_embedder.linear_1.bias"]
state_dict.pop("extra_embedder.linear_1.weight")
state_dict.pop("extra_embedder.linear_1.bias")
# extra_embedder.linear_2
state_dict[f"{prefix}extra_embedder.linear_2.weight"] = state_dict["extra_embedder.linear_2.weight"]
state_dict[f"{prefix}extra_embedder.linear_2.bias"] = state_dict["extra_embedder.linear_2.bias"]
state_dict.pop("extra_embedder.linear_2.weight")
state_dict.pop("extra_embedder.linear_2.bias")
model.load_state_dict(state_dict)
model.to(dtype)
pipe = HunyuanDiTPipeline.from_pretrained(
repo,
tokenizer_2 = tokenizer_2,
text_encoder_2 = text_encoder_2,
transformer = model,
torch_dtype=dtype)
@yiyixuxu
Thanks for merging!
Reply to your TODOs:
-
Doc added here: https://moon-ci-docs.huggingface.co/docs/diffusers/pr_8383/en/api/pipelines/hunyuandit
-
I manually tested and verified the results. The current version has the same output as the original model. I will try to add more rigorous tests later.
-
(1) AdaLayerNormShift: I asked the team member and they were following the practice in SDXL: See https://github.com/huggingface/diffusers/blob/413604405fddb4692a8e9a9a9fb6c353d22881ea/src/diffusers/models/resnet.py#L343 (L343 - L351). I notice the place of
self.normin HunyuanDiT is different from SDXL, but I think it is just a small mistake. As far as I learned, it is not a scientific innovation and I suggest to keep it insidehunyuan_transformer_2d.py
(2) Scheduler: I tested several fast samplers. From my test, I think it is safe to switch from ``DDPMSchedulertoDDIMScheduler`. Other schedulers seem to be fine but I would like to leave that for the community.
4 and 5: I merged your PR in XCLiu/HunyuanDiT-0523 and saved a new checkpoint for the diffusers pipeline. We are moving the new checkpoint to the official account after going through some internal checks. The name will be Tencent-Hunyuan/HunyuanDiT-Diffusers. Let's change the example doc in the pipeline file when it's officially online.