iamangus

Results 13 comments of iamangus

I host code-server out of my local k8s cluster where I also run Ollama and would love to be able to integrate this.

Also having this issue with backblaze. I also want to add that I had this exact config deployed in december and it worked fine.

Here is an attempt using a model out of the llama.cpp walkthrough: https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true ```[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli -m ~/gemma-1.1-7b-it.Q4_K_M.gguf -p "Hello!" -ngl 999 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found...

It is working when not using `-ngl 999`. I do see vram utilization when running `rocm-smi`, but the output is CPU-slow. I will clone again and do a CPU only...

Unfortunately the CPU build is failing. ```[root@localhost llama-cpu.cpp]# cmake --build build --config Release [ 4%] Built target ggml-base [ 8%] Built target ggml-cpu [ 9%] Built target ggml [ 19%]...

Unfortunately I do not have that binary and am not able to find it.

It seems building with -DGGML_CUDA_NO_PEER_COPY=ON does fix the issue. It seems like the entire model is loaded into the VRAM of both GPUs. Is there a way around this?

Upon further inspection it looks like it is actually splitting the model between the GPUs. Are you able to provide any details around the impact of building with that flag?

Should that not be taken care of by installing amdgpu-dkms? This might be specific to the install method I went with on the almalinux. I will try doing the same...

Okay I have done some more testing. I tried almalinux again but using rocm 6.2.2, and same deal. I just finished trying ubuntu 22.04 with rocm 6.2.2 and I am...