llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Add support to deepseek vl2

Open FredericoPerimLopes opened this issue 1 year ago • 4 comments

Prerequisites

  • [x] I am running the latest code. Mention the version if possible as well.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Add support to deepseek vl2

Motivation

Deepseek released vl2 a very capable vision model.

Possible Implementation

https://github.com/deepseek-ai/DeepSeek-VL2/tree/main

FredericoPerimLopes avatar Feb 05 '25 15:02 FredericoPerimLopes

That is a multi-modal model.

As per issue #8010, multi-modal models (i.e. text + images) might or might not be supported by llama.cpp's command line (it certainly is not supported by llama.cpp's web server utility).

ollama says it supports multi-modal models - it might already support vl2, or it might be close to supporting it already.

NotCompsky avatar Feb 06 '25 14:02 NotCompsky

That is a multi-modal model.

As per issue #8010, multi-modal models (i.e. text + images) might or might not be supported by llama.cpp's command line (it certainly is not supported by llama.cpp's web server utility).

ollama says it supports multi-modal models - it might already support vl2, or it might be close to supporting it already.

But llama.cpp has support for llava and other multimodal models, so I thought it could be added in the future.

FredericoPerimLopes avatar Feb 06 '25 16:02 FredericoPerimLopes

Yes, if llava works then I guess multimodal is supported by the command line utility. It must be the web server that doesn't support it - there was a bit of conflicting information on the other issue.

In that case, I second this.

NotCompsky avatar Feb 07 '25 15:02 NotCompsky

Yes, if llava works then I guess multimodal is supported by the command line utility. It must be the web server that doesn't support it - there was a bit of conflicting information on the other issue.

In that case, I second this.

llama.cpp supports multimodal LLM. Many multimodal implementations are based on or refer to LLaVA implementations. They can be found here: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava Including LLaVA, Qwen-VL and MiniCPM-V (MiniCPM-O). The implementation for Qwen-VL is so poor that GG once disabled GPU offload for long to reduce issues from incorrect of Qwen-VL.

In addition, I understand that ollama is a package based on llama.cpp, and its capabilities are inherited from llama.cpp

I have looked at DeepSeek's Janus-Pro and DeepSeek-VL2 these days. Since Janus-Pro is non-standard and integrates multi-modality and text-to-image(like Stable Diffusion) capabilities in 2-in-1, I think it is not realistic to support it in llama.cpp in the short term, but DeepSeek-VL2 is expected to support it. I am also waiting for some guru or DeepSeek to support DeepSeek-VL2 in llama.cpp (like OpenBMB's MiniCPM series)

orca-zhang avatar Feb 15 '25 12:02 orca-zhang

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 01 '25 01:04 github-actions[bot]