tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

Clarification on mutually exclusive use of self-attention and cross-attention in DiffusionModelUNetMaisi

Open danielemolino opened this issue 7 months ago • 4 comments

Hi MONAI team,

while reading through the implementation of DiffusionModelUNetMaisi, I noticed the following logic for enabling attention at each level:

with_attn = attention_levels[i] and not with_conditioning
with_cross_attn = attention_levels[i] and with_conditioning

This effectively means that self-attention is never used when the model is in conditioning mode (with_conditioning=True), even if attention_levels[i] is True.

Is this behavior intentional?

In other diffusion-based architectures such as Stable Diffusion, it is common practice to enable both self-attention and cross-attention simultaneously within the same layers.

Would it be acceptable, or even recommended, to modify the logic as follows to allow both mechanisms in parallel? Or is there a specific reason this mutual exclusivity was enforced?

Looking forward to your insights, and thank you for the great work on this model!

Best regards, Daniele Molino

danielemolino avatar Jun 05 '25 09:06 danielemolino

Any News?

danielemolino avatar Jun 26 '25 15:06 danielemolino

Hi @dongyang0122 @guopengf I think this is code you've worked on recently, would you have any insights here? Thanks!

ericspod avatar Jul 10 '25 17:07 ericspod

The DiffusionModelUNetMaisi implementation of MONAI is largely inspired by the following references. We did not explore this aspect in depth during development. We would appreciate it if you could let us know whether enabling both settings improves performance for your use case. Thank you.

https://github.com/Project-MONAI/GenerativeModels/blob/main/generative/networks/nets/diffusion_model_unet.py#L1787-L1788 https://github.com/Project-MONAI/MONAI/blob/dev/monai/networks/nets/diffusion_model_unet.py#L1643-L1644

dongyang0122 avatar Jul 22 '25 22:07 dongyang0122

Thank you for your clarification and for sharing the references. I manually implemented the option to enable both self- and cross-attention within the DiffusionModelUNetMaisi architecture and ran several experiments. In my case, I did not observe any clear performance improvement, especially considering the noticeable increase in computational cost.

danielemolino avatar Oct 14 '25 10:10 danielemolino