how to do embedding arithmetic?
Hi.
Thanks for the great work.
I have a question about the embedding arithmetic that you do to combine audio and image embeddings in main_multi_bind.py in line: https://github.com/sail-sg/BindDiffusion/blob/8c47ef7674be6eac0131b81e8446e8e397cd11bb/main_multi_bind.py#L331
outs = model.embedder(inputs, normalize=False)
embeddings1 = outs[ModalityType.AUDIO]
embeddings2 = outs[ModalityType.VISION]
# embeddings1 = embeddings1 / torch.norm(embeddings1, dim=-1, keepdim=True)
# embeddings2_norm = torch.norm(embeddings2, dim=-1, keepdim=True)
# embeddings2 = embeddings2 / embeddings2_norm
# embeddings = (opt.alpha * embeddings1 + (1 - opt.alpha) * embeddings2) * embeddings2_norm
embeddings = (opt.alpha * embeddings1 + (1 - opt.alpha) * embeddings2)
In the ImageBind paper, they say: For arithmetic, we again use the embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described above.
Aren't the embeddings L2 normalized in the imagebind code?
By using outs = model.embedder(inputs, normalize=False) the embeddings will not be normalized anymore...
Could you please explain why you skip the normalization step in your implementation?
Thank you for your time and patience.
Hi. Thanks for the interest. The alignment of ImageBind is performed on the normalized embedding and the normalization of ImageBind aims to align with cosine distance as you said, while the original stability's unCLIP model is conditioned on the embedding without normalization. Therefore, a very simple and naive way for condition is to skip the normalization and make direct fusion.
Dear @ikuinen. Thank you for your answer.
This means that if I want to do embedding arithmetic to retrieve some images I should follow the ImageBind normalization, right?