Does AIMv2 have a text encoder like the CLIP and SigLIP(2) have?
I would like to do a text-image search.
Does AIMv2 have a text encoder like what the CLIP and SigLIP(2) have?
Thanks a lot.
Hi! No, AIMv2 does not have a text encoder that you can use for cross modal retrieval (see paper for more information). However, we have a LiT-tuned version of the encoder that can do what you request: link
Thank you for your prompt answer.
Hi, I followed the example code at https://huggingface.co/apple/aimv2-large-patch14-224-lit, but I still have difficulty to make it work. In the example,
inputs = processor( images=image, text=text, add_special_tokens=True, truncation=True, padding=True, return_tensors="pt", ) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=-1)
However, what I need is separate image encoder and text encoder? I tried the processor without text input, it always says "TypeError: AIMv2Model.forward() missing 1 required positional argument: 'input_ids'
Can you suggest how to code up the image encoder and the text encode just like what CLIP or sigLIP have? Thank you.