ViT + CLIP
Would it be worth implementing ViT and CLIP example?
Yea that ones on our list of examples to add! Are you interested in contributing it? If so which model would you use?
Yea that ones on our list of examples to add! Are you interested in contributing it? If so which model would you use?
I would like to contribute :) However, I would like to complete the implementation of norm first (https://github.com/ml-explore/mlx/pull/187). I would use models from the official CLIP repository: https://github.com/openai/CLIP. If you have an alternative idea, please let me know.
@gboduljak I submitted a PR to your existing PR, which creates a local implementation of the CLIPImageProcessor. https://github.com/gboduljak/mlx-examples/pull/1
This should eliminate the dependency on transformers, aside from using it for downloading the model & tokenizer.
@nkasmanoff Thanks for the help. I will take a look at your work now.
@nkasmanoff I merged your PR, corrected the nits and I refactored your implementation so that everything is in preprocessing folder. Many thanks for the help. In future, we might drop this 'copy-paste' implementation from HuggingFace. Ideally, we should use mlx-data. If you have time, it would be awesome to have mlx-data implementation of CLIPImageProcessor.