tabby icon indicating copy to clipboard operation
tabby copied to clipboard

bug: `--chat-device` option broken (Mixed GPU + CPU for completion + chat models)

Open jtbr opened this issue 1 year ago • 9 comments

Please describe the feature you want

I've been using a large completion model with my GPU. I'd like to add a chat model as well, but there's not enough GPU memory for the large completion model plus a reasonable sized chat model. Since the latter is less latency-dependent, it would seem to make sense to put it on the CPU. That way I don't have to sacrifice completion speed or performance. But I don't see a way (at least with docker) to put models on different devices. Am I missing something? This would seem to be useful feature for many.


Please reply with a 👍 if you want this feature.

jtbr avatar Jun 27 '24 12:06 jtbr

Thank you for submitting the feature request. This aligns well with the need for more precise control over how the model is served. I recommend initiating the model serving backend independently and connecting Tabby to it through an HTTP backend. For a concise guide, please visit our documentation at https://tabby.tabbyml.com/docs/administration/model/#llamacpp. For example, you can launch the model serving backend using llama.cpp and manage the number of layers processed on the GPU with the -ngl flag.

Should you face any challenges during your experimentation, please don't hesitate to share them here

wsxiaoys avatar Jun 27 '24 12:06 wsxiaoys

Thanks for your response. Am I correct in understanding your proposal is to run llama.cpp outside of tabby's container, and point tabby to that server for the chat completion? Or is this something that tabby can/will do within the docker container?

jtbr avatar Jun 27 '24 12:06 jtbr

Either way is possible - though you need to deal with the orchestration (process level within container or container level) carefully.

wsxiaoys avatar Jun 27 '24 15:06 wsxiaoys

I found that the tabby serve command has a --chat-device that seems to be exactly what I was looking for.

However it doesn't seem to be working for me in 0.12.0: If --device is cuda, I still see that both models are placed into GPU memory even if --chat-device cpu is set.

(I also tried running a separate tabby docker instance for the chat model (in CPU mode), while pointing the main docker instance to it with [model.chat.http]. However I am currently blocked in testing this workaround approach by #2422)

jtbr avatar Jun 27 '24 16:06 jtbr

You can consider ollama backend https://tabby.tabbyml.com/docs/administration/model/#ollama for this purpose. I am using it partly for that. It has Env config for control how many models can be loaded at time. It defaults to 1. This works good if you are ok that only one works at the time(Completion, chat). The ollama will automatically unload completion model and loads chat model to fulfill chat request and then it will switch back to completion model when new request for completion received. I find it more convenient for models ~10+ B parameters because temporary switching to it much much faster than using CPU only for the model.

SpeedCrash100 avatar Jun 27 '24 17:06 SpeedCrash100

This might not be the best place, but since I got the idea there, I'll ask. I tried using ollama to do the "load only one model" thing, but with the config in the documentation, chat completions will not work. TabbyML do a POST on "/chat/completions" which returns a 404.

Config:

[model.completion.http]
model_id = "Code"
kind = "ollama/completion"
model_name = "codellama:7b"
api_endpoint = "http://127.0.0.1:11434"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"

[model.chat.http]
model_id = "Chat"
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434"

Actual code completion in IDE do work. I'd appreciate any help on this, as I assume it should not be too complicated to setup. However, if this is too much information to discuss in this issue I'll gladly move that somewhere else.

CleyFaye avatar Sep 05 '24 00:09 CleyFaye

According to https://ollama.com/blog/openai-compatibility

would it possible you need to append /v1 to your configuration? e.g

[model.chat.http]
model_id = "Chat"
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434/v1"

wsxiaoys avatar Sep 05 '24 00:09 wsxiaoys

Ah, yes. That was it; I didn't dig enough, sorry for the noise, and thanks, it works fine now!

CleyFaye avatar Sep 05 '24 00:09 CleyFaye

In case you wanna share your setup - feel free to start a discussion thread in https://github.com/TabbyML/tabby/discussions/categories/show-and-tell, thank you!

wsxiaoys avatar Sep 05 '24 00:09 wsxiaoys