LocalAI icon indicating copy to clipboard operation
LocalAI copied to clipboard

Failure to detect GPU driver

Open jobidon opened this issue 1 year ago • 4 comments

LocalAI version v2.24.2

Environment LXC under proxmox Linux localAI 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 x86_64 x86_64 GNU/Linux System has 2 GPUs and the 1st one is disabled. card0 is a GTX 750 card1 is a Quadro P4000 (PCI passthrough) nVidia v550.142 CUDA 12.4

Problem localAI is not detecting the GPU driver despite recognizing the GPUs in the node I suspect localAI uses the information from the first GPU to detect which driver is used. But since in this case card0 is not passed through, it might not be able to determine the driver correctly.

steps using PCI passthrough in the host server, and installing nvidia drivers both in the host and the guest. Confirming that the drivers are installed correctly and running the install script. The installer correctly reports using the GPU, but the debug logs indicate that the driver is not detected.

Expected Detection of the nvidia drivers on card1

Logs The first 2 warnings pertain to a 2nd GPU installed on the host, but not passedthrough to the LXC WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory WARNING: error parsing the pci address "simple-framebuffer.0" 6:27AM DBG GPU count: 2 2 GPUs are indeed detected 6:27AM DBG GPU: card #0 @simple-framebuffer.0 this GPU is ignored 6:27AM DBG GPU: card #1 @0000:03:00.0 -> driver: '' class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'GP104GL [Quadro P4000]' this is the main GPU. it is recognized and enabled. However, the driver field is empty

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P4000                   Off |   00000000:03:00.0 Off |                  N/A |
| 46%   29C    P8              5W /  105W |       2MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

GPU usage is always at 0% and there are never any active processes .

Additional context localAI runs but is extremely slow because the GPU is detected but not activated. The nvidia drivers are active and detect the GPU correctly, but localAI apparently cannot.

jobidon avatar Jan 10 '25 06:01 jobidon

I saw similar behavior with EKS GPU nodes. nvidia-smi can show GPU, but LocalAI announces GPU count 0, and there is no Python process can be seen in nvidia-smi. However, I can run models smoothly, and I can see GPU memory in nvidia-smi from 2Mi to more than 4Gi... Really confusing 😵 haha

kyleli666 avatar Mar 20 '25 05:03 kyleli666

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 30 '25 02:06 github-actions[bot]

Still seeing this issue on 3.1.1 If I restart the docker container models load via the GPU for a day or two but then they stop accessing the GPU, especially if I let everything sit unused for a few days.

Lurick73 avatar Jul 10 '25 18:07 Lurick73

that's a common bug. you must issue nvidia-msi on the host. the problem is that it's not using the GPU also in my case

sugar012 avatar Sep 10 '25 13:09 sugar012

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Dec 10 '25 02:12 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Dec 15 '25 02:12 github-actions[bot]