AnimateDiff Why encoder_hidden_state is used in the motion module?

Can you explain why encoder_hidden_state is used in the motion module? The motion module as expressed in the paper is a vanilla temporal attention, not cross-attention. https://github.com/guoyww/AnimateDiff/blob/cf80ddeb47b69cf0b16f225800de081d486d7f21/animatediff/models/unet_blocks.py#L411

Feb 29 '24 14:02 junwenxiong

As I looked Inside of the motion module's attention (VersatileAttention), encoder_hidden_states become the hidden_states, so ultimately the attention operates as self-attention. In other words, encoder_hidden_states is not used in the motion module's attention.

Mar 06 '24 05:03 Taeu

But it seems that these do not prove that the encoder_hidden_states are none and replaced by the hidden_states.

Mar 06 '24 05:03 junwenxiong

You should check the TemporalTransformerBlock in motion_module.py. When creating VersatileAttention block, the cross_attention_dim is None since attention_block_types are [ "Temporal_Self", "Temporal_Self" ]. Then you can find that the encoder_hidden_states are none and replaced by the hidden_states.

Mar 06 '24 07:03 Taeu