llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Can't offload layers to GPU

Open nneubacher opened this issue 1 year ago • 5 comments

Hello,

I am having problems to properly setup llama.cpp server using CUDA on WSL.

When trying to run the llama.cpp server, regardless of passing the '--n_gpu' flag, I get no feedback on the amount of offloaded layers and a memory warning:

" ... llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: CPU buffer size = 4165.37 MiB warning: failed to mlock 74465280-byte buffer (after previously locking 0 bytes): Cannot allocate memory Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root). ... "

I am not able to offload any layers to the GPU and I thought this may be due to my WSL (Ubuntu) not recognizing the hardware correctly, but running 'nvidia-smi' returns the info for my RTX:

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.10 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 On | N/A | | 0% 56C P8 16W / 170W | 747MiB / 12288MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

Regarding the ulimit warning I've checked for the limits:

" real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 31491 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 31491 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited "

Im very new to Linux and using WSL so I am sorry if this seems trivial. I'm also not sure if this is the right place to ask for help, but I don't really know where to turn else.

nneubacher avatar Mar 23 '24 15:03 nneubacher

warning: failed to mlock 74465280-byte buffer (after previously locking 0 bytes): Cannot allocate memory

just need sudo

sudo fixed this warning for me

0x131315 avatar Mar 23 '24 16:03 0x131315

Glad you found a solution! Unfortunately, for me the warning as well as the problem of no layer offloading persists even when running as root...

nneubacher avatar Mar 23 '24 16:03 nneubacher

I think you'll find the issue is WSL can only access 50% of your system RAM by default.

You can increase it by creating a .wslconfig file.

https://learn.microsoft.com/en-us/windows/wsl/wsl-config

askmyteapot avatar Mar 25 '24 00:03 askmyteapot

An Example Config file

[wsl2]
memory=52GB

[experimental]
networkingMode=mirrored 

The mirrored networking allows you to access llama.cpp over the network. And one thing i do to speed up file transfer to and from WSL is building a 6.1.x kernel. I use https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6 as a guide on how to build and use it.

Hope that helps.

askmyteapot avatar Mar 25 '24 00:03 askmyteapot

It is hard to diagnose the issue with so little information. The mlock error is not relevant to GPU acceleration, and it is not clear why you are using this flag. My guess is that either you are using a build without CUDA, or the CUDA driver is not properly installed, in which case you would get a message about this.

slaren avatar Mar 25 '24 00:03 slaren