mmf Minimal working example

Thank you guys for the amazing job and for releasing FashionViL model.

I would like to use such a model in an image-to-text retrieval setting, but I am not capable of extracting the features from texts and from images. Could you please provide a minimal working example where you have as inputs a text and an image and as output their features?

In other words, I'm asking for a snippet similar to the following one (but instead of using clip I would like to use FashionViL)

model, preprocess = clip.load('RN50')
image = preprocess(PIL.Image.open('dog.jpg')).unsqueeze(0)
text = 'a photo of a dog'
tokenized_text = clip.tokenize(text)

image_features = clip.encode_image(image)
text_features = clip.encode_text(tokenized_text)

Thanks again for the amazing work

Oct 27 '22 16:10 ABaldrati

Hi, sorry for the late reply.

I think the output_dict in the following forward function is what you need.

https://github.com/BrandonHanx/mmf/blob/d63a31f83918ab60cd00cff3b72bd0e455ed1100/mmf/models/fashionvil/contrastive.py#L32-L57

Nov 14 '22 17:11 BrandonHanx

I assume this issue has been solved. Closed.

Dec 31 '22 12:12 BrandonHanx