cortex.cpp bug: nitro cuda windows low performance on machine has multiple GPUs

Describe the bug My windows machine has 3 GPUs, when I enabled all 3 GPUs, the token speed was slow (6-9/s) and it even not able to load tinyllama 1B. When I disabled 2 GPUs, 1 active only, the performance was back to normal

Screenshots

3 GPUs active
- Low performance
- Load tinyllama error
1 GPU active only, then the performance was back to normal

Desktop (please complete the following information):

OS: Windows 11
Nvidia driver: 531.18
cuda version: 12.3
Nitro version: 0.1.27
GPU:
1 RTX 4070ti
2 RTX 1660ti

Dec 14 '23 08:12 hiento09

@hiento09 I have a feeling that this problem coming from the communication between different GPUs. I'll look out for this while reading the codebase right now.

Dec 15 '23 00:12 KossBoii

@KossBoii that's the exact problem of multiple GPU problem. I tested again on that machine:

If using only 4070ti => 55tok/sec
If using either 1 out of 2 2 1660ti => 28tok/sec

The distributed inference requires:

Good bandwidth between GPUs
The discrepancies between multiple GPUs should be not too much (e.g in this case 4070ti have to wait for 1660ti to compute). And also this case uses PCIe3 and 4, not NVlink => The data have to transmitted via CPU to get to another GPU.
Explicitly set the value for TP (tensor parallel) in nitro.

It depends but I think the option to use 1 model on a single GPU with the help of CUDA_VISIBLE_DEVICES makes sense in this case (i.e hardware sensing feature)

Dec 17 '23 08:12 hiro-v

This should be properly supported with this instead: https://github.com/ggerganov/llama.cpp/pull/6017

Mar 22 '24 02:03 hiro-v

closing in favor of tracking this more granularly, now that we have various engines

Jul 01 '24 05:07 freelerobot

bug: nitro cuda windows low performance on machine has multiple GPUs - tested using Jan App