Embedding dimensions
Hello , I was a bit confused as the supplementary material and paper describes that image and text features are extracted in 768 dimension using CLIP , however looking at the code the embeddings are described as having 512 dimensional shape. Is there something I'm missing or is there a way you are downscaling from 768 to 512 dimension
Hello , I was a bit confused as the supplementary material and paper describes that image and text features are extracted in 768 dimension using CLIP , however looking at the code the embeddings are described as having 512 dimensional shape. Is there something I'm missing or is there a way you are downscaling from 768 to 512 dimension
Hello, I'm also confused about this. And I wonder do you know how the dataloader load the image and text features? I didn't the see the clip extracting code or image and text features loading code. Thanks a lot.
I used some random projection matrix to reduce the dimensions, which looks weird now; if you are going to re-implement on your own, I recommend not doing that.