cortex.cpp [WIP] Nitro model management

Feature for https://github.com/janhq/nitro/issues/175

[x] Load multiple models
[ ] Add GET models to return models list
[ ] CUDA support for multiple model request at the same time (CCU 1 for each model)
[x] Metal support for multiple model request at the same time (CCU 1 for each model)
[x] CPU support for multiple model request at the same time (CCU 1 for each model)

Jan 05 '24 18:01 hiro-v

Load model 1 first time:

curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'

Load model 1 subsequent times:
Model 1 /chat/completion works normally

curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3-copy",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }'

Load model 1 first time:

curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'

Load model 2 subsequent times:
Model 2 /chat/completion works normally

curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }'

Jan 08 '24 02:01 hiro-v

Test result for running 2 concurrent request to 2 llama.cpp on the same process given the change: It still crashes.

GGML_ASSERT: /home/hiro/jan/nitro/llama.cpp/ggml-cuda.cu:6742: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])

The line: https://github.com/ggerganov/llama.cpp/blob/1bf681f90ef4cf37b36e6d604d3e30fc57eda650/ggml-cuda.cu#L6742

Trace to variable g_cuda_pool_addr and g_cuda_pool_used https://github.com/ggerganov/llama.cpp/blob/1bf681f90ef4cf37b36e6d604d3e30fc57eda650/ggml-cuda.cu#L6664

static CUdeviceptr g_cuda_pool_addr[GGML_CUDA_MAX_DEVICES] = {0};
static size_t g_cuda_pool_used[GGML_CUDA_MAX_DEVICES] = {0};

These are static global variable and are always shared across all thread in the same application. This global pool thing is everywhere across the llama cpp cuda code base. So refactoring would be hard + required extensive knowledge in llama cpp. It short, it's not thread safe.

Steps to reproduce

Load 2 models (we can just change the model_id param)

curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3-copy",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a not very curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'

Run this below command (it has & by the end to be able to run in parallel)

curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }' &
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3-copy",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }' &

Jan 08 '24 14:01 hiro-v

I also added

"n_parallel": 2,
"cont_batching": true

But it does not help on CUDA. However on CPU, Mac Metal it works flawlessly without the need to add extra variables

Jan 08 '24 15:01 hiro-v