diffusers Support dynamically loading/unloading loras with group offloading

Fixes #11791.

reproducer

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
from diffusers.hooks import apply_group_offloading
from diffusers.utils.logging import set_verbosity_debug

set_verbosity_debug()


model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
pipe.vae.enable_tiling()
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe.load_lora_weights("a-r-r-o-w/HunyuanVideo-tuxemons")
num_inference_steps = 20

list(map(
    lambda x: apply_group_offloading(x, onload_device, offload_device, offload_type="leaf_level", use_stream=True),
    [pipe.transformer]
))
[module.to(onload_device) for module in (pipe.text_encoder, pipe.text_encoder_2, pipe.vae)]

output = pipe(
    prompt="Style of snomexut, a cat-like Tuxemon creature walks in alien-world grass, and observes its surroundings.",
    height=768,
    width=768,
    num_frames=33,
    num_inference_steps=num_inference_steps,
    generator=torch.Generator().manual_seed(73),
).frames[0]
export_to_video(output, "output.mp4", fps=15)

pipe.unload_lora_weights()

output_1 = pipe(
    prompt="Style of snomexut, a cat-like Tuxemon creature walks in alien-world grass, and observes its surroundings.",
    height=768,
    width=768,
    num_frames=33,
    num_inference_steps=num_inference_steps,
    generator=torch.Generator().manual_seed(73),
).frames[0]
export_to_video(output_1, "output2.mp4", fps=15)

pipe.load_lora_weights("a-r-r-o-w/HunyuanVideo-tuxemons")

output_2 = pipe(
    prompt="Style of snomexut, a cat-like Tuxemon creature walks in alien-world grass, and observes its surroundings.",
    height=768,
    width=768,
    num_frames=33,
    num_inference_steps=num_inference_steps,
    generator=torch.Generator().manual_seed(73),
).frames[0]
export_to_video(output_2, "output3.mp4", fps=15)

Jun 25 '25 07:06 a-r-r-o-w

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 25 '25 07:06 HuggingFaceDocBuilderDev

cc @zhangvia I think this should fix the issues you were facing. Could you test? Thanks 🤗

Jun 25 '25 08:06 a-r-r-o-w

And maybe we could include disabling and enabling group offloading in the existing func_optionally*() function. But not strong opinions.

Yep, done!

Jun 25 '25 09:06 a-r-r-o-w

cc @zhangvia I think this should fix the issues you were facing. Could you test? Thanks 🤗

Thanks for the quick fix！ i've Confirmed it.

Jun 26 '25 02:06 zhangvia

A bit of a problem with the tests that weren't caught until now...

Group offloading with streams is limited in what it can do. If the same layer is invoked twice in the same parent layer's forward, the prefetching logic becomes completely incorrect. This is the case with:

CogView4 - invokes the same MLP on both hidden_states and encoder_hidden_states in succession
CogVideoX - invokes the same layernorm on both hidden_states and encoder_hidden_states in succession

One option to make it work is to concatenate inputs across sequence dim and then split again. This will, however, incur some extra perf cost because concatenation/split is not free.

Another option is creating a separate layer for each data stream and sharing the weights. I don't think this should incur memory overhead since the data reference will be same for both layers, but need to test to be sure.

Any other ideas are welcome. If these don't sound good, I propose we skip the tests for now and wait for group offloading logic to become more mature/stable. Other than than PR looks good to merge to me

Jun 27 '25 14:06 a-r-r-o-w

Thanks for investigating these and also for proposing the potential solutions.

On the surface, I would say we evaluate both approaches and then decide. However, the two models you mentioned probably have limited usage at least with group offloading for now. So, xfailing them is the more reasonable approach. WDYT?

Jun 27 '25 14:06 sayakpaul

I tested the first approach as it's a super simple change. The performance penalty is not noticeable end-to-end but only shows up at a small microsecond scale. I don't think it really matters because, like you mentioned, they probably have very limited usage in the context of group offloading.

For now, I'll specialize the tests for CogView4/CogVideoX by parameterizing with only non-stream tests instead of xfailing them, and add a note. Sounds good?

Jun 27 '25 14:06 a-r-r-o-w

For now, I'll specialize the tests for CogView4/CogVideoX by parameterizing with only non-stream tests instead of xfailing them, and add a note. Sounds good?

I am good!

Jun 27 '25 14:06 sayakpaul

Updated the tests. There are some gynamistics involved in skipping tests marked with parameterized because it seems like can't be overwritten or specialized in child classes

Jun 27 '25 15:06 a-r-r-o-w