Clarification on mutually exclusive use of self-attention and cross-attention in DiffusionModelUNetMaisi
Hi MONAI team,
while reading through the implementation of DiffusionModelUNetMaisi, I noticed the following logic for enabling attention at each level:
with_attn = attention_levels[i] and not with_conditioning
with_cross_attn = attention_levels[i] and with_conditioning
This effectively means that self-attention is never used when the model is in conditioning mode (with_conditioning=True), even if attention_levels[i] is True.
Is this behavior intentional?
In other diffusion-based architectures such as Stable Diffusion, it is common practice to enable both self-attention and cross-attention simultaneously within the same layers.
Would it be acceptable, or even recommended, to modify the logic as follows to allow both mechanisms in parallel? Or is there a specific reason this mutual exclusivity was enforced?
Looking forward to your insights, and thank you for the great work on this model!
Best regards, Daniele Molino
Any News?
Hi @dongyang0122 @guopengf I think this is code you've worked on recently, would you have any insights here? Thanks!
The DiffusionModelUNetMaisi implementation of MONAI is largely inspired by the following references. We did not explore this aspect in depth during development. We would appreciate it if you could let us know whether enabling both settings improves performance for your use case. Thank you.
https://github.com/Project-MONAI/GenerativeModels/blob/main/generative/networks/nets/diffusion_model_unet.py#L1787-L1788 https://github.com/Project-MONAI/MONAI/blob/dev/monai/networks/nets/diffusion_model_unet.py#L1643-L1644
Thank you for your clarification and for sharing the references. I manually implemented the option to enable both self- and cross-attention within the DiffusionModelUNetMaisi architecture and ran several experiments. In my case, I did not observe any clear performance improvement, especially considering the noticeable increase in computational cost.