diffusers How about forcing the first and last block on device when groupoffloading is used?

Is your feature request related to a problem? Please describe. When group offloading is enabled, the offload and onload cannot be streamed between steps and this is really a big time comsuming problem.

Describe the solution you'd like. Is it possible to add an option that could make the first and last block forced on device to avoid offload and onload?

@a-r-r-o-w Could you please give some help? Thanks so much.

Jul 21 '25 08:07 seed93

I have tried to make it a circle here: https://github.com/huggingface/diffusers/blob/v0.33.1/src/diffusers/hooks/group_offloading.py#L319 However it seems not work.

Jul 21 '25 14:07 seed93

Hi @seed93, forcing particular block(s), or other internal layers, to remain on the device should be possible. I'll try to prototype something later this week. I'd imagine denoting what layers to keep on device will have to be determined via a regex, so that ModuleGroup's of remaining modules can be formed easily. Trying to prefetch the first layer from the last layer might also be possible 🤔

Jul 21 '25 20:07 a-r-r-o-w

Thank you so much. I tried a circular prefetch as I said on the last comment but it seems not work. I checked this using nsys and the memcpy between steps still exists. As the image shows below, the offload and onload were captured by pre_forward and post_forward of GroupOffloadingHook.

Jul 22 '25 00:07 seed93

diff --git a/src/diffusers/hooks/group_offloading.py b/src/diffusers/hooks/group_offloading.py
index 36abf3c41..d1397e237 100644
--- a/src/diffusers/hooks/group_offloading.py
+++ b/src/diffusers/hooks/group_offloading.py
@@ -435,11 +435,12 @@ class LazyPrefetchGroupOffloadingHook(ModelHook):
             base_module_group_offloading_hook.next_group = group_offloading_hooks[0].group
             base_module_group_offloading_hook.next_group.onload_self = False
 
-        for i in range(num_executed - 1):
+        for i in range(num_executed):
             name1, _ = self.execution_order[i]
-            name2, _ = self.execution_order[i + 1]
+            next_i = (i + 1) % num_executed
+            name2, _ = self.execution_order[next_i]
             logger.debug(f"Applying lazy prefetch group offloading from {name1} to {name2}")
-            group_offloading_hooks[i].next_group = group_offloading_hooks[i + 1].group
+            group_offloading_hooks[i].next_group = group_offloading_hooks[next_i].group
             group_offloading_hooks[i].next_group.onload_self = False
 
         return output

Jul 22 '25 01:07 seed93

We can flag the first and last layers so that they are not offloaded in the post_forward method. The only issue is that we can only do that through _apply_lazy_group_offloading_hook to be sure what layer is the first or what layer is the last. But still, this would be a net gain.

Another way would be to provide the first and last module names as an argument, but then that becomes a bit tedious to use.

I can do the code for the first idea if that path is fine for you.

Nov 11 '25 22:11 axel-havard

I would like to work on this as a part of Diffusers MVP program , will take a deeper look at the issue & will get back with the proposal. Thank you : )

Nov 19 '25 16:11 11happy

Hi Diffusers team,

I’d like to work on this feature as part of the Diffusers MVP program.

The idea is to add a new flag/config to enable_group_offload, e.g. pin_first_last. When this flag is enabled, all group offloading for that module would be routed through the lazy path (_apply_lazy_group_offloading_hook), which already tracks runtime execution order via execution_order. After the first forward, we can identify the first and last executed groups and mark those groups so they stay on the onload device instead of being offloaded each step.

This avoids relying on container/module declaration order and works even for architectures where blocks are executed in a different order than they’re defined (e.g. separate blocks / vace_blocks stacks in transformer_wan_vace.py)

If this direction sounds good, I’m happy to open a PR for block_level group offloading first, and I’d be glad to coordinate with the previous contributor who mentioned this approach if they’re interested in collaborating :).

Nov 22 '25 17:11 bconstantine

Excellent discussion here, I think bconstantine's pin_first_last approach is solid, but I'd propose we go one step further for maximum flexibility: What if we made this granular and composable? Instead of a single boolean flag, allow users to specify a pin_groups parameter that accepts:

"first_last" (bconstantine's approach )
"all" (keep everything on device )
A callable that receives the execution_order and returns indices to pin

This way, simple case (pin_groups="first_last") works out of the box for 90% of users, Power users can write custom logic for architectures with weird execution patterns (like the transformer_wan_vace.py case mentioned) and debugging becomes trivial, just set pin_groups="all" to verify offload overhead isn't the problem The implementation would live in apply_lazy_group_offloading_hook since it already has the execution_order visibility. We'd just need to mark pinned groups with a flag that the post_forward hook respects. The circular prefetch attempt failed because it only prefetches, it doesn't prevent the offload. We need to actually skip the offload entirely for pinned groups, not just try to hide the latency.

Hey @sayakpaul, Can i pick this up for MVP program?

Nov 26 '25 10:11 Aki-07

@bconstantine let's do this. @Aki-07 I would encourage collaborating with @bconstantine if possible.

Nov 26 '25 10:11 sayakpaul

Also let's remember: https://github.com/huggingface/diffusers/pull/12692

Nov 26 '25 10:11 sayakpaul

Sure will colloborate with @bconstantine

Nov 26 '25 15:11 Aki-07

Hey @sayakpaul I and @Aki-07 have opened a fix for this in PR #12747

Summary:

We add optional pin_groups argument to enable_group_offloading for model and pipeline level, which expects one of the following value

None (default): offload all groups to CPU (no pinned groups)
"all": pin all executed groups to the accelerator device
"first_last": pin only the first and last executed groups (based on execution order)
callable: user-defined pinning logic. We support three signatures:
- fn(submodule)
- fn(layer_name, submodule)
- fn(layer_name, submodule, layer_idx) where layer_name/submodule come from the first-pass execution trace (named_modules + runtime order), and layer_idx is the index over the filtered executed modules that have parameters.

We track the execution_order during runtime by using the lazy hook
We implement the tests for each of the cases mentioned above, and invalid cases where user provided invalid callable format, and invalid string value for pin_groups
We implement the block_level solution and leave the leaf_level operation untouched for now, also the standalone layers problem mentioned in #12692 to prevent solution scope conflict as we notice that the PR is still active.

Would love to hear your feedback, thanks!

Dec 02 '25 15:12 bconstantine

Thank you! We will review soonish. Apologies for the delay on our end.

Dec 02 '25 15:12 sayakpaul