MetaTransformer Issues about Image Classification

Hi, thanks for your outstanding work! I am trying to use meta-transformer to conduct image classification. I noticed that in the paper, you wrote "On image classification, with the help of CLIP [24] text encoder, Meta-Transformer delivers great performances under zero-shot classification". Does it mean that I need to use the CLIP text encoder to help realize image classification, rather than using a simple linear layer? Looking forward to your reply!

Sep 19 '24 07:09 Lelucermaire111

zero-shot evaluation requires text-encoder.

Sep 20 '24 10:09 invictus717

zero-shot evaluation requires text-encoder.

Thanks for your reply! I am trying to reproduce the image classification with Meta-Transformer. However, the classification accuracy is not very high when I conduct the linear probe experiments. Can you provide some details for the image classification details, such as learning rate, patch_size, img_size, ...?

Sep 25 '24 02:09 Lelucermaire111

I find the problem. I was trying to train the image tokenizer and classifier head from scratch, but the performance is not good. If I use the CLIP tokenizer and only train the classifier head, the classification performance is very good.

Sep 25 '24 09:09 Lelucermaire111

Exactly, we use CLIP for pretraining

Sep 25 '24 09:09 invictus717

Hi, I have another question about the pretrained modality-agnostic model. If you pretrain it on LAION-2B using CLIP, can I use the pretrained CLIP model to conduct similar experiments? Have you tried to use the pretrained CLIP model?

Oct 14 '24 06:10 Lelucermaire111

Maybe, I think the key is the proposed tokenizer.

Oct 14 '24 06:10 invictus717