Issues about Image Classification
Hi, thanks for your outstanding work! I am trying to use meta-transformer to conduct image classification. I noticed that in the paper, you wrote "On image classification, with the help of CLIP [24] text encoder, Meta-Transformer delivers great performances under zero-shot classification". Does it mean that I need to use the CLIP text encoder to help realize image classification, rather than using a simple linear layer? Looking forward to your reply!
zero-shot evaluation requires text-encoder.
zero-shot evaluation requires text-encoder.
Thanks for your reply! I am trying to reproduce the image classification with Meta-Transformer. However, the classification accuracy is not very high when I conduct the linear probe experiments. Can you provide some details for the image classification details, such as learning rate, patch_size, img_size, ...?
I find the problem. I was trying to train the image tokenizer and classifier head from scratch, but the performance is not good. If I use the CLIP tokenizer and only train the classifier head, the classification performance is very good.
Exactly, we use CLIP for pretraining
Hi, I have another question about the pretrained modality-agnostic model. If you pretrain it on LAION-2B using CLIP, can I use the pretrained CLIP model to conduct similar experiments? Have you tried to use the pretrained CLIP model?
Maybe, I think the key is the proposed tokenizer.