Why encoder_hidden_state is used in the motion module?
Can you explain why encoder_hidden_state is used in the motion module? The motion module as expressed in the paper is a vanilla temporal attention, not cross-attention.
https://github.com/guoyww/AnimateDiff/blob/cf80ddeb47b69cf0b16f225800de081d486d7f21/animatediff/models/unet_blocks.py#L411
As I looked Inside of the motion module's attention (VersatileAttention), encoder_hidden_states become the hidden_states, so ultimately the attention operates as self-attention. In other words, encoder_hidden_states is not used in the motion module's attention.
But it seems that these do not prove that the encoder_hidden_states are none and replaced by the hidden_states.
You should check the TemporalTransformerBlock in motion_module.py. When creating VersatileAttention block, the cross_attention_dim is None since attention_block_types are [ "Temporal_Self", "Temporal_Self" ]. Then you can find that the encoder_hidden_states are none and replaced by the hidden_states.