flex-dm Embedding dimensions

Hello , I was a bit confused as the supplementary material and paper describes that image and text features are extracted in 768 dimension using CLIP , however looking at the code the embeddings are described as having 512 dimensional shape. Is there something I'm missing or is there a way you are downscaling from 768 to 512 dimension

Sep 17 '23 20:09 Ashwin-Pokharel

Hello , I was a bit confused as the supplementary material and paper describes that image and text features are extracted in 768 dimension using CLIP , however looking at the code the embeddings are described as having 512 dimensional shape. Is there something I'm missing or is there a way you are downscaling from 768 to 512 dimension

Hello, I'm also confused about this. And I wonder do you know how the dataloader load the image and text features? I didn't the see the clip extracting code or image and text features loading code. Thanks a lot.

Sep 25 '23 09:09 KeyaoZhao

I used some random projection matrix to reduce the dimensions, which looks weird now; if you are going to re-implement on your own, I recommend not doing that.

Oct 17 '24 23:10 naoto0804