CLIP_prefix_caption have you tried different CLIP models?

Hi @rmokady, Thank you for your nice work, I learned a lot from it. Since the default CLIP model you are using seems to be the ViT-B32 version, I am wondering if you have tried other visual features e.g. from ViT-L or the resnet models? I can't find it mentioned in the paper. I'm trying to train a similar model at the moment and assume the features extracted from bigger vision encoders would contain more information.

Best, David

Jul 24 '22 19:07 dhansmair

Have you tried it? I just have same question.

Sep 27 '22 02:09 eeyrw

I tried ViT-L/14. You have to just change it in inference code and feature extractor code. For example parse_coco.py:

parser.add_argument('--clip_model_type', default="ViT-L/14", choices=('RN50', 'RN101', 'RN50x4', 'ViT-B/32', 'ViT-L/14'))

Just add argument choice. But in train code we need to change prefix_dim. It is 768 for ViT-L/14

Feb 24 '23 14:02 ret7020

I tried ViT-L/14. You have to just change it in inference code and feature extractor code. For example parse_coco.py:
parser.add_argument('--clip_model_type', default="ViT-L/14", choices=('RN50', 'RN101', 'RN50x4', 'ViT-B/32', 'ViT-L/14'))
Just add argument choice. But in train code we need to change prefix_dim. It is 768 for ViT-L/14

Hi there, would you mind sharing your ViT/L-14 model checkpoints? Thanks.

Apr 27 '23 02:04 eliphatfs

I tried ViT-L/14. You have to just change it in inference code and feature extractor code. For example parse_coco.py:
parser.add_argument('--clip_model_type', default="ViT-L/14", choices=('RN50', 'RN101', 'RN50x4', 'ViT-B/32', 'ViT-L/14'))
Just add argument choice. But in train code we need to change prefix_dim. It is 768 for ViT-L/14
Hi there, would you mind sharing your ViT/L-14 model checkpoints? Thanks.

I would be very interested in having access to the checkpoints too! :blush:

Jan 17 '24 18:01 alexisthual