Add specific vision model, do not use multimodal models
Issue
Sometime, I did not work with multimodal models like as gpt-4o, claude-sonet-3.5 .. but I would like to use a specific vision model to read an image file or read an image from clipbroad. So I expect a feature to add specific vision model to do that. The command shall like: aider --model gpt-4o --weak-model llama 3.1-450b --vision-model vLMs ; then if user input an image, aider will call vision model to generate text; if user do not config --vision-model, aider will try to use main model for an image with multi-modal required.
Version and model info
Aider V.0.55.0
have you found any implementation in this case? i want to feed the image of the web page that i am developing to this aider and let it know how the current state of the code looks like and based on which it can develop it further and make corrections and improvements . let me know if you find anything in this direction.