llama-cpp-python Can't install GPU version for windows for many times.

Issues

I am trying to install the lastest version of llama-cpp-python in my windows 11 with RTX-3090ti(24G). I have successfully installed llama-cpp-python=0.1.87 (can't exactly remember) months ago while using:

set FORCE_CMAKE=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

But when I want to access the latest version recently by using:

set CMAKE_ARGS="-DLLAMA_CUDA=on"
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

After loading the model, it is still using CPU with BLAS=0 (or is another params = 1 instead of BLAS in new version?).

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 |
Model metadata: {'general.name': 'Meta-Llama-3-8B-Instruct-imatrix', 'general.architecture': 'llama', 'llama.block_count': '32', 'llama.context_length': '8192', 'tokenizer.ggml.eos_token_id': '128001', 'general.file_type': '18', 'llama.attention.head_count_kv': '8', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'llama.rope.freq_base': '500000.000000', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.model': 'gpt2', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '128000', 'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

I have been trying the pre-build wheel for CUDA 12.1 (pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121) and it still doesn't work. I add --verbose to see the output:

loading initial cache file C:\Users\Administrator\AppData\Local\Temp\tmpbbvy3nqu\build\CMakeInit.txt
  -- Building for: Visual Studio 17 2022
  -- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
  -- The C compiler identification is MSVC 19.39.33523.0
  -- The CXX compiler identification is MSVC 19.39.33523.0
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Check for working C compiler: F:/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: F:/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Found Git: F:/Git/cmd/git.exe (found version "2.44.0.windows.1")
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
  -- Looking for pthread_create in pthreads
  -- Looking for pthread_create in pthreads - not found
  -- Looking for pthread_create in pthread
  -- Looking for pthread_create in pthread - not found
  -- Found Threads: TRUE
  -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
  -- CMAKE_SYSTEM_PROCESSOR: AMD64
  -- CMAKE_GENERATOR_PLATFORM: x64
  -- x86 detected
  -- Performing Test HAS_AVX_1
  -- Performing Test HAS_AVX_1 - Success
  -- Performing Test HAS_AVX2_1
  -- Performing Test HAS_AVX2_1 - Success
  -- Performing Test HAS_FMA_1
  -- Performing Test HAS_FMA_1 - Success
  -- Performing Test HAS_AVX512_1
  -- Performing Test HAS_AVX512_1 - Failed
  -- Performing Test HAS_AVX512_2
  -- Performing Test HAS_AVX512_2 - Failed

Environment

python=3.12 C++ compiler: viusal studio 2022 (with necessary C++ modules) cmake --version = 3.29.2 nvcc -V = CUDA 12.1 (while nvidia-smi cuda version is 12.3, i think it is not related to this issues)

I have been download and install VS2022, CUDA toolkit, cmake and anaconda, I am wondering if some steps are missing. Considering my previous experience there is no need to git clone this code and cd into it to build (Though I did that on my mac to build a pth file to bin file months ago).

My system variables are listed below:

F:\Anaconda\Scripts
F:\CMake\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin

Questions

Are there some steps i have been missing for build the llama-cpp for GPU version?
How to know if it is built for GPU version when i run pip install llama-cpp-python instead of loading model to check BLAS=1.
Do I need to git clone this code and cd into some dir, create some file or dir to run the pip install llama-cpp-py?

Apr 27 '24 10:04 XunfunLee

I checked #1352 , and is there an issus related to windows 11? I just thought it this the problem of my installation steps or my machine. Is there an official explaination plz !!!

Apr 27 '24 11:04 XunfunLee

Im having the same problem but with Linux, 20,04, using Kaggle Notebook, worked fine until yesterday.

edit: pip install llama-cpp-python==0.2.64 solves the problem.

Apr 27 '24 19:04 holchan

Im having the same problem but with Linux, 20,04, using Kaggle Notebook, worked fine until yesterday.

edit: pip install llama-cpp-python==0.2.64 solves the problem.

Still not working, I have been trying 0.2.64, 0.2.60, 0.2.59 for many times and it seems to said:

Creating "ggml_shared.dir\Release\ggml_shared.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified.
    Touching "ggml_shared.dir\Release\ggml_shared.tlog\unsuccessfulbuild".
  CustomBuild:
    Building Custom Rule C:/Users/Administrator/AppData/Local/Temp/pip-install-_thkprn2/llama-cpp-python_9fa670d7909f4acfb3ac1882363d1df6/vendor/llama.cpp/CMakeLists.txt

Apr 28 '24 01:04 XunfunLee

The lama.dll is Win32 and we are 64 Bit on Windows 11, if I debug the C++ checker program in Win32 then the lama.dll loads successfully, but for 64 bit nope. #include #include <windows.h>

int main() { // Update this path to the actual location of the llama_cpp.dll HINSTANCE hDLL = LoadLibrary(TEXT("C:\my_path\llama-cpp-python\llama_cpp\llama.dll"));

if (hDLL == NULL) {
	std::cerr << "ERROR: unable to load DLL" << std::endl;
	return 1;
}

std::cout << "DLL loaded successfully" << std::endl;

FreeLibrary(hDLL);
return 0;

}

Apr 28 '24 20:04 VinzanR

The lama.dll is Win32 and we are 64 Bit on Windows 11, if I debug the C++ checker program in Win32 then the lama.dll loads successfully, but for 64 bit nope. #include #include <windows.h>

int main() { // Update this path to the actual location of the llama_cpp.dll HINSTANCE hDLL = LoadLibrary(TEXT("C:\my_path\llama-cpp-python\llama_cpp\llama.dll"));
if (hDLL == NULL) {
	std::cerr << "ERROR: unable to load DLL" << std::endl;
	return 1;
}

std::cout << "DLL loaded successfully" << std::endl;

FreeLibrary(hDLL);
return 0;
} Well I think i can understand why, but I still don't know how to fix this problem, can you give more info or steps plz?

Apr 29 '24 08:04 XunfunLee

cuda version i'm using v12.4, windows 10 i think it will also work with windows 11

I have tried this from Windows PowerShell and it works for me

$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
$env:CUDACXX="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe"
pip install --upgrade --force-reinstall --no-cache-dir llama-cpp-python

Apr 30 '24 13:04 parveen232

I'm having the same issue. I have CUDA installed, nvcc works, and CUDA_PATH is set. Doing: set CMAKE_ARGS=-DLLAMA_CUDA=ON pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

Don't see any errors in the installation. Yet when I run it I get BLAS = 0:

Getting the same result with: set CMAKE_ARGS=-DLLAMA_CUBLAS=ON pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

May 07 '24 22:05 MarianoMolina

cuda version i'm using v12.4, windows 10 i think it will also work with windows 11

I have tried this from Windows PowerShell and it works for me
$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
$env:CUDACXX="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe"
pip install --upgrade --force-reinstall --no-cache-dir llama-cpp-python

It seems to be the problem of windows 11, i made it worked in windows 10 months ago (while the version of llama-cpp-python==0.1.72), but when i turn into the latest version with win11 it doesn't work :(

May 09 '24 03:05 XunfunLee

I'm having the same issue. I have CUDA installed, nvcc works, and CUDA_PATH is set. Doing: set CMAKE_ARGS=-DLLAMA_CUDA=ON pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

Don't see any errors in the installation. Yet when I run it I get BLAS = 0:

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

Getting the same result with: set CMAKE_ARGS=-DLLAMA_CUBLAS=ON pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

Yeap, if you make it work plz let me know :) I will keep trying to find the solution as well.

May 09 '24 03:05 XunfunLee

Although it didn't work initially, I was able to download the prebuilt wheel and it works, now I am getting GPU inference. It does seem like there is an issue with my environment in some way.

May 09 '24 15:05 MarianoMolina