LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

Support multimodals models with vLLM

Open mudler opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. Many models are now becoming multi-model, that is they can accept images, videos or audio during inference. The llama.cpp project is currently providing multimodal support and we do as well by using it, however there are models which aren't supported yet (for instance #3535 and #3669, see also https://github.com/ggerganov/llama.cpp/issues/9455 )

Describe the solution you'd like LocalAI to support vLLM multimodal capabilities

Describe alternatives you've considered

Additional context See #3535 and #3669, tangentially related to: #2318 #3602

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py

mudler avatar Sep 26 '24 08:09 mudler

agree

3unnycheung avatar Sep 26 '24 11:09 3unnycheung

I am very interested in the support of vision models in localAI particularly Llama-3.2-11B-Vision and Pixtral-12b

SuperPat45 avatar Sep 26 '24 12:09 SuperPat45

With https://github.com/mudler/LocalAI/pull/3729 should cover most of the models and add also video understanding. Model configuration files needs to be specify placeholders used by models for image/video tags in the text prompt, going to experiment with this once in master and update the model gallery with few examples.

mudler avatar Oct 04 '24 17:10 mudler

Great news. However, I miss two Docker images master-cublas-cuda12-ffmpeg and master-aio-gpu-nvidia-cuda-12

AlexM4H avatar Oct 05 '24 07:10 AlexM4H