ImageBind icon indicating copy to clipboard operation
ImageBind copied to clipboard

Vision x Vision NOT what we want

Open zxyonaroll opened this issue 2 years ago • 3 comments

image As you can see above, I use the original assets(text, image, audio) in main branch, and find that Vision x Vision is not correct when dog_image x dog_image is not 1 while the other two is 1

zxyonaroll avatar May 10 '23 07:05 zxyonaroll

Thanks for your question. Unlike other modalities, Vision logits are not scaled by a temperature: https://github.com/facebookresearch/ImageBind/blob/0f8620b6678fd24c35f172721ea6046ab5780890/models/imagebind_model.py#L432

If we look at the cosine similarity for Vision x Vision (so dropping the softmax), you can see the diagonal is exactly 1.0, which matches the expected behaviour.

tensor([[1.0000, 0.3682, 0.4185],
        [0.3682, 1.0000, 0.3172],
        [0.4185, 0.3172, 1.0000]], device='cuda:0')

Please let us know if you have any questions.

aelnouby avatar May 10 '23 11:05 aelnouby

Thanks for your question. Unlike other modalities, Vision logits are not scaled by a temperature:

https://github.com/facebookresearch/ImageBind/blob/0f8620b6678fd24c35f172721ea6046ab5780890/models/imagebind_model.py#L432

If we look at the cosine similarity for Vision x Vision (so dropping the softmax), you can see the diagonal is exactly 1.0, which matches the expected behaviour.

tensor([[1.0000, 0.3682, 0.4185],
        [0.3682, 1.0000, 0.3172],
        [0.4185, 0.3172, 1.0000]], device='cuda:0')

Please let us know if you have any questions.

So when to use softmax and when to use cosine? Is there a uniform measurement standards, which I think is the original mind of this large model? One Embedding Space To Bind Them All, So I think there is one uniform output standard. How do you think, thank you very much.

zxyonaroll avatar May 11 '23 03:05 zxyonaroll

So when to use softmax and when to use cosine? I am also trying to understand the above discussion. If I want to find the most similar image to a given image, what should I use and how? Thanks.

bakachan19 avatar May 23 '23 07:05 bakachan19