planning: Supporting vision model (Llava and Llama3.2)
Problem Statement
To support Vision models on Cortex, we need the following:
- [ ] 1. Download model .gguf and mmproj file
- [ ] 2.
v1/models/starttakes inmodel_path(.gguf) andmmprojparameters - [ ] 3.
/chat/completionsto take in messages contentimage_url - [ ] 4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
- [ ] 5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B)
- [ ] 6. Pull correct NGL settings from chat model. Ref issue #1763
1. Downloading model .gguf and mmprog file:
For fully compatible with Jan, cortex should be able to pull mmproj file along with GGUF file.
Let's take the image below for example.
Scenario steps:
- User want to download llava model and expect it to support vision. So, user input:
- Direct URL to the GGUF file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf), or Url to repository (we will list options filter
.gguffile) for user to select. - Since
mmprojis also ended with.gguf, we also listed that in the selection.
- Cortex will only pull that selected GGUF file, ignoring that:
- mmproj.gguf alone won't work.
- only traditional gguf file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf) will not have vision feature.
So, we need to come up with a way so that cortex knows when to download the mmproj file along with traditional gguf file.
cc @dan-homebrew , @louis-jan , @nguyenhoangthuan99, @vansangpfiev
Feature Idea
Couple of thoughts:
- File name based.
1.1. For CLI: Ignore file name contains
mmprojwhen presenting selection list. And download it along with selected traditional gguf file. 1.2. For API: Always scan the directory with same level as the URL provided. If there's ammprojfile name, cortex adds it to the download list.
- Edge case: If user provide a direct URL to
mmprojfile, return error with clear error message.
- Thinking / You tell me
Updates:
- CLI cortex pull presents .gguf and mmproj files
-
mmprojparam is added to /v1/models/start parameters in #1537
We should ensure that model.yaml supports this type of abstraction, cc @hahuyhoang411
@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?
To support Vision models on Cortex, we need the following:
- Download model - downloads .gguf and mmproj file -> What is the model pull UX?
- v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
- /chat/completions to take in messages content image_url ✅
- image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
- model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..
@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?
To support Vision models on Cortex, we need the following:
- Download model - downloads .gguf and mmproj file -> What is the model pull UX?
- v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
- /chat/completions to take in messages content image_url ✅
- image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
- model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..
- I'm not sure about this yet, since 1 folder can have multiple chat model files with 1 mmproj file.
- Yes
- I'm not sure if this is a good UX
- image_url can be a local path to image,
llama-cppengine support encoding image to base64 and pass it to model. -
llama-cppengine supports BakLlava 1, llava 7B, llava 13B.llama.cppupstream has already supportedMiniCPM-V 2.6, we can integrate it tollama-cpp.llama.cppupstream does not support llama3.2 vision yet.
We probably need to consider changing the UX for inferencing with vision model, for example:
cortex run llava-7b --image xx.jpg -p "What is in the image?"
Thank you @vansangpfiev and @hahuyhoang411! Quick notes from call:
- upstream llama.cpp -> cortex.llama-cpp needs to expose vision parameters to cortex.cpp
- Ease of models support: LLava, then MiniCPM.
- Llama3.2 vision
Added an action item, where model management should pull metadata from chat model file instead of projector file (just to make sure we tracked this)