ml-aim Does AIMv2 have a text encoder like the CLIP and SigLIP(2) have?

I would like to do a text-image search.
Does AIMv2 have a text encoder like what the CLIP and SigLIP(2) have?
Thanks a lot.

Apr 14 '25 20:04 wjtan99

Hi! No, AIMv2 does not have a text encoder that you can use for cross modal retrieval (see paper for more information). However, we have a LiT-tuned version of the encoder that can do what you request: link

Apr 15 '25 07:04 DonkeyShot21

Thank you for your prompt answer.

Apr 15 '25 14:04 wjtan99

Hi, I followed the example code at https://huggingface.co/apple/aimv2-large-patch14-224-lit, but I still have difficulty to make it work. In the example,

inputs = processor( images=image, text=text, add_special_tokens=True, truncation=True, padding=True, return_tensors="pt", ) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=-1)

However, what I need is separate image encoder and text encoder? I tried the processor without text input, it always says "TypeError: AIMv2Model.forward() missing 1 required positional argument: 'input_ids'

Can you suggest how to code up the image encoder and the text encode just like what CLIP or sigLIP have? Thank you.

Apr 16 '25 03:04 wjtan99