cortex.cpp icon indicating copy to clipboard operation
cortex.cpp copied to clipboard

[WIP] Nitro model management

Open hiro-v opened this issue 2 years ago • 3 comments

Feature for https://github.com/janhq/nitro/issues/175

  • [x] Load multiple models
  • [ ] Add GET models to return models list
  • [ ] CUDA support for multiple model request at the same time (CCU 1 for each model)
  • [x] Metal support for multiple model request at the same time (CCU 1 for each model)
  • [x] CPU support for multiple model request at the same time (CCU 1 for each model)

hiro-v avatar Jan 05 '24 18:01 hiro-v

  • Load model 1 first time: CleanShot 2024-01-08 at 09 43 30
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'
  • Load model 1 subsequent times: CleanShot 2024-01-08 at 09 43 37
  • Model 1 /chat/completion works normally CleanShot 2024-01-08 at 09 44 25
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3-copy",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }'
  • Load model 1 first time: CleanShot 2024-01-08 at 09 43 51
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'
  • Load model 2 subsequent times: CleanShot 2024-01-08 at 09 44 02
  • Model 2 /chat/completion works normally CleanShot 2024-01-08 at 09 49 29
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }'

hiro-v avatar Jan 08 '24 02:01 hiro-v

Test result for running 2 concurrent request to 2 llama.cpp on the same process given the change: It still crashes.

GGML_ASSERT: /home/hiro/jan/nitro/llama.cpp/ggml-cuda.cu:6742: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])

The line: https://github.com/ggerganov/llama.cpp/blob/1bf681f90ef4cf37b36e6d604d3e30fc57eda650/ggml-cuda.cu#L6742

Trace to variable g_cuda_pool_addr and g_cuda_pool_used https://github.com/ggerganov/llama.cpp/blob/1bf681f90ef4cf37b36e6d604d3e30fc57eda650/ggml-cuda.cu#L6664

static CUdeviceptr g_cuda_pool_addr[GGML_CUDA_MAX_DEVICES] = {0};
static size_t g_cuda_pool_used[GGML_CUDA_MAX_DEVICES] = {0};

These are static global variable and are always shared across all thread in the same application. This global pool thing is everywhere across the llama cpp cuda code base. So refactoring would be hard + required extensive knowledge in llama cpp. It short, it's not thread safe.

Steps to reproduce

  • Load 2 models (we can just change the model_id param)
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'
curl --location 'http://localhost:3928/inferences/llamacpp/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3-copy",
    "llama_model_path": "/Users/hiro/Downloads/tinyllama-1.1b-chat-v0.3.Q2_K.gguf",
    "ctx_len": 512,
    "ngl": 32,
    "embedding": false,
    "pre_prompt": "A chat between a not very curious user and an artificial intelligence",
    "user_prompt": "USER: ",
    "ai_prompt": "ASSISTANT: "
}'
  • Run this below command (it has & by the end to be able to run in parallel)
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }' &
curl --location 'http://localhost:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model_id": "tinyllama-1.1b-chat-v0.3-copy",
    "messages": [
      {
        "role": "user",
        "content": "What is the biggest tech company in the world by market cap?"
      }
    ],
    "stream": true
  }' &

hiro-v avatar Jan 08 '24 14:01 hiro-v

I also added

"n_parallel": 2,
"cont_batching": true

But it does not help on CUDA. However on CPU, Mac Metal it works flawlessly without the need to add extra variables

hiro-v avatar Jan 08 '24 15:01 hiro-v