cortex.cpp icon indicating copy to clipboard operation
cortex.cpp copied to clipboard

planning: Supporting vision model (Llava and Llama3.2)

Open namchuai opened this issue 1 year ago • 6 comments

Problem Statement

To support Vision models on Cortex, we need the following:

  • [ ] 1. Download model .gguf and mmproj file
  • [ ] 2. v1/models/start takes in model_path (.gguf) and mmproj parameters
  • [ ] 3. /chat/completions to take in messages content image_url
  • [ ] 4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
  • [ ] 5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B)
  • [ ] 6. Pull correct NGL settings from chat model. Ref issue #1763

1. Downloading model .gguf and mmprog file:

For fully compatible with Jan, cortex should be able to pull mmproj file along with GGUF file.

Let's take the image below for example. Screenshot 2024-10-16 at 08 35 06

Scenario steps:

  1. User want to download llava model and expect it to support vision. So, user input:
  • Direct URL to the GGUF file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf), or Url to repository (we will list options filter .gguf file) for user to select.
  • Since mmproj is also ended with .gguf, we also listed that in the selection.
  1. Cortex will only pull that selected GGUF file, ignoring that:
  • mmproj.gguf alone won't work.
  • only traditional gguf file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf) will not have vision feature.

So, we need to come up with a way so that cortex knows when to download the mmproj file along with traditional gguf file.

cc @dan-homebrew , @louis-jan , @nguyenhoangthuan99, @vansangpfiev

Feature Idea

Couple of thoughts:

  1. File name based. 1.1. For CLI: Ignore file name contains mmproj when presenting selection list. And download it along with selected traditional gguf file. 1.2. For API: Always scan the directory with same level as the URL provided. If there's a mmproj file name, cortex adds it to the download list.
  • Edge case: If user provide a direct URL to mmproj file, return error with clear error message.
  1. Thinking / You tell me

namchuai avatar Oct 16 '24 01:10 namchuai

Updates:

  1. CLI cortex pull presents .gguf and mmproj files image
  2. mmproj param is added to /v1/models/start parameters in #1537

gabrielle-ong avatar Nov 07 '24 05:11 gabrielle-ong

We should ensure that model.yaml supports this type of abstraction, cc @hahuyhoang411

dan-menlo avatar Nov 07 '24 05:11 dan-menlo

@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?

To support Vision models on Cortex, we need the following:

  1. Download model - downloads .gguf and mmproj file -> What is the model pull UX?
  2. v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
  3. /chat/completions to take in messages content image_url ✅
  4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
  5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..

gabrielle-ong avatar Nov 07 '24 06:11 gabrielle-ong

@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?

To support Vision models on Cortex, we need the following:

  1. Download model - downloads .gguf and mmproj file -> What is the model pull UX?
  2. v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
  3. /chat/completions to take in messages content image_url ✅
  4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
  5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..
  1. I'm not sure about this yet, since 1 folder can have multiple chat model files with 1 mmproj file.
  2. Yes
  3. I'm not sure if this is a good UX
  4. image_url can be a local path to image, llama-cpp engine support encoding image to base64 and pass it to model.
  5. llama-cpp engine supports BakLlava 1, llava 7B, llava 13B. llama.cpp upstream has already supported MiniCPM-V 2.6, we can integrate it to llama-cpp. llama.cpp upstream does not support llama3.2 vision yet.

We probably need to consider changing the UX for inferencing with vision model, for example:

cortex run llava-7b --image xx.jpg -p "What is in the image?"

vansangpfiev avatar Nov 07 '24 06:11 vansangpfiev

Thank you @vansangpfiev and @hahuyhoang411! Quick notes from call:

  • upstream llama.cpp -> cortex.llama-cpp needs to expose vision parameters to cortex.cpp
  • Ease of models support: LLava, then MiniCPM.
  • Llama3.2 vision

gabrielle-ong avatar Nov 07 '24 08:11 gabrielle-ong

Added an action item, where model management should pull metadata from chat model file instead of projector file (just to make sure we tracked this)

louis-jan avatar Dec 04 '24 06:12 louis-jan