[Question] Shouldn't the inputs for memory_efficient_attention in the CrossAttention have [B, sequence length, Heads, embeding size] format?
According to the xformer's documentation (https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.memory_efficient_attention ):
Input tensors must be in format [B, M, H, K], where B is the batch size, M the sequence length, H the number of heads, and K the embeding size per head If inputs have dimension 3, it is assumed that the dimensions are [B, M, K] and H=1
But in the CrossAttention class https://github.com/huggingface/diffusers/blob/086c7f9ea8f0fde5c62e52289604ec5b178da207/src/diffusers/models/attention.py#L584 the inputs have (batch_size // head_size, seq_len, dim * head_size) format, from https://github.com/huggingface/diffusers/blob/086c7f9ea8f0fde5c62e52289604ec5b178da207/src/diffusers/models/attention.py#L557
Is that right? Wouldn't it affect the scaling_factor of the attention?
Hey @Warvito,
Note that after this operation: https://github.com/huggingface/diffusers/blob/847daf25c7e461795932099c5097eb8ac489645c/src/diffusers/models/attention.py#LL344C45-L344C45
The tensor format is [batch_dim, seq_len, dim] where as the head dim is part of batch_dim so I think this is correct :-)