mlx-examples ViT + CLIP

Would it be worth implementing ViT and CLIP example?

Dec 18 '23 19:12 gboduljak

Yea that ones on our list of examples to add! Are you interested in contributing it? If so which model would you use?

Dec 18 '23 20:12 awni

Yea that ones on our list of examples to add! Are you interested in contributing it? If so which model would you use?

I would like to contribute :) However, I would like to complete the implementation of norm first (https://github.com/ml-explore/mlx/pull/187). I would use models from the official CLIP repository: https://github.com/openai/CLIP. If you have an alternative idea, please let me know.

Dec 18 '23 21:12 gboduljak

@gboduljak I submitted a PR to your existing PR, which creates a local implementation of the CLIPImageProcessor. https://github.com/gboduljak/mlx-examples/pull/1

This should eliminate the dependency on transformers, aside from using it for downloading the model & tokenizer.

Jan 15 '24 19:01 nkasmanoff

@nkasmanoff Thanks for the help. I will take a look at your work now.

Jan 15 '24 23:01 gboduljak

@nkasmanoff I merged your PR, corrected the nits and I refactored your implementation so that everything is in preprocessing folder. Many thanks for the help. In future, we might drop this 'copy-paste' implementation from HuggingFace. Ideally, we should use mlx-data. If you have time, it would be awesome to have mlx-data implementation of CLIPImageProcessor.

Jan 16 '24 01:01 gboduljak