BindDiffusion how to do embedding arithmetic?

Hi. Thanks for the great work. I have a question about the embedding arithmetic that you do to combine audio and image embeddings in main_multi_bind.py in line: https://github.com/sail-sg/BindDiffusion/blob/8c47ef7674be6eac0131b81e8446e8e397cd11bb/main_multi_bind.py#L331

outs = model.embedder(inputs, normalize=False)
embeddings1 = outs[ModalityType.AUDIO]
embeddings2 = outs[ModalityType.VISION]

# embeddings1 = embeddings1 / torch.norm(embeddings1, dim=-1, keepdim=True)
# embeddings2_norm = torch.norm(embeddings2, dim=-1, keepdim=True)
# embeddings2 = embeddings2 / embeddings2_norm
# embeddings = (opt.alpha * embeddings1 + (1 - opt.alpha) * embeddings2) * embeddings2_norm

embeddings = (opt.alpha * embeddings1 + (1 - opt.alpha) * embeddings2)

In the ImageBind paper, they say: For arithmetic, we again use the embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described above. Aren't the embeddings L2 normalized in the imagebind code? By using outs = model.embedder(inputs, normalize=False) the embeddings will not be normalized anymore...

Could you please explain why you skip the normalization step in your implementation?

Thank you for your time and patience.

May 23 '23 13:05 bakachan19

Hi. Thanks for the interest. The alignment of ImageBind is performed on the normalized embedding and the normalization of ImageBind aims to align with cosine distance as you said, while the original stability's unCLIP model is conditioned on the embedding without normalization. Therefore, a very simple and naive way for condition is to skip the normalization and make direct fusion.

May 26 '23 02:05 ikuinen

Dear @ikuinen. Thank you for your answer.

This means that if I want to do embedding arithmetic to retrieve some images I should follow the ImageBind normalization, right?

May 26 '23 07:05 bakachan19