have you tried different CLIP models?
Hi @rmokady, Thank you for your nice work, I learned a lot from it. Since the default CLIP model you are using seems to be the ViT-B32 version, I am wondering if you have tried other visual features e.g. from ViT-L or the resnet models? I can't find it mentioned in the paper. I'm trying to train a similar model at the moment and assume the features extracted from bigger vision encoders would contain more information.
Best, David
Have you tried it? I just have same question.
I tried ViT-L/14. You have to just change it in inference code and feature extractor code.
For example parse_coco.py:
parser.add_argument('--clip_model_type', default="ViT-L/14", choices=('RN50', 'RN101', 'RN50x4', 'ViT-B/32', 'ViT-L/14'))
Just add argument choice. But in train code we need to change prefix_dim. It is 768 for ViT-L/14
I tried ViT-L/14. You have to just change it in inference code and feature extractor code. For example
parse_coco.py:parser.add_argument('--clip_model_type', default="ViT-L/14", choices=('RN50', 'RN101', 'RN50x4', 'ViT-B/32', 'ViT-L/14'))Just add argument choice. But in train code we need to change prefix_dim. It is 768 for ViT-L/14
Hi there, would you mind sharing your ViT/L-14 model checkpoints? Thanks.
I tried ViT-L/14. You have to just change it in inference code and feature extractor code. For example
parse_coco.py:parser.add_argument('--clip_model_type', default="ViT-L/14", choices=('RN50', 'RN101', 'RN50x4', 'ViT-B/32', 'ViT-L/14'))Just add argument choice. But in train code we need to change prefix_dim. It is 768 for ViT-L/14
Hi there, would you mind sharing your ViT/L-14 model checkpoints? Thanks.
I would be very interested in having access to the checkpoints too! :blush: