[WIP] Nitro model management
Feature for https://github.com/janhq/nitro/issues/175
- [x] Load multiple models
- [ ] Add GET
modelsto return models list - [ ] CUDA support for multiple model request at the same time (CCU 1 for each model)
- [x] Metal support for multiple model request at the same time (CCU 1 for each model)
- [x] CPU support for multiple model request at the same time (CCU 1 for each model)
- Load model 1 first time:
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3",
"llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
"ctx_len": 512,
"ngl": 32,
"embedding": false,
"pre_prompt": "A chat between a curious user and an artificial intelligence",
"user_prompt": "USER: ",
"ai_prompt": "ASSISTANT: "
}'
- Load model 1 subsequent times:
- Model 1
/chat/completionworks normally
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3-copy",
"messages": [
{
"role": "user",
"content": "What is the biggest tech company in the world by market cap?"
}
],
"stream": true
}'
- Load model 1 first time:
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3",
"llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
"ctx_len": 512,
"ngl": 32,
"embedding": false,
"pre_prompt": "A chat between a curious user and an artificial intelligence",
"user_prompt": "USER: ",
"ai_prompt": "ASSISTANT: "
}'
- Load model 2 subsequent times:
- Model 2
/chat/completionworks normally
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3",
"messages": [
{
"role": "user",
"content": "What is the biggest tech company in the world by market cap?"
}
],
"stream": true
}'
Test result for running 2 concurrent request to 2 llama.cpp on the same process given the change: It still crashes.
GGML_ASSERT: /home/hiro/jan/nitro/llama.cpp/ggml-cuda.cu:6742: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])
The line: https://github.com/ggerganov/llama.cpp/blob/1bf681f90ef4cf37b36e6d604d3e30fc57eda650/ggml-cuda.cu#L6742
Trace to variable g_cuda_pool_addr and g_cuda_pool_used
https://github.com/ggerganov/llama.cpp/blob/1bf681f90ef4cf37b36e6d604d3e30fc57eda650/ggml-cuda.cu#L6664
static CUdeviceptr g_cuda_pool_addr[GGML_CUDA_MAX_DEVICES] = {0};
static size_t g_cuda_pool_used[GGML_CUDA_MAX_DEVICES] = {0};
These are static global variable and are always shared across all thread in the same application. This global pool thing is everywhere across the llama cpp cuda code base. So refactoring would be hard + required extensive knowledge in llama cpp. It short, it's not thread safe.
Steps to reproduce
- Load 2 models (we can just change the
model_idparam)
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3",
"llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
"ctx_len": 512,
"ngl": 32,
"embedding": false,
"pre_prompt": "A chat between a curious user and an artificial intelligence",
"user_prompt": "USER: ",
"ai_prompt": "ASSISTANT: "
}'
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3-copy",
"llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
"ctx_len": 512,
"ngl": 32,
"embedding": false,
"pre_prompt": "A chat between a not very curious user and an artificial intelligence",
"user_prompt": "USER: ",
"ai_prompt": "ASSISTANT: "
}'
- Run this below command (it has
&by the end to be able to run in parallel)
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3",
"messages": [
{
"role": "user",
"content": "What is the biggest tech company in the world by market cap?"
}
],
"stream": true
}' &
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model_id": "tinyllama-1.1b-chat-v0.3-copy",
"messages": [
{
"role": "user",
"content": "What is the biggest tech company in the world by market cap?"
}
],
"stream": true
}' &
I also added
"n_parallel": 2,
"cont_batching": true
But it does not help on CUDA. However on CPU, Mac Metal it works flawlessly without the need to add extra variables