llama-cpp-python Can't install llama-cpp-python with HIPBLAS/ROCm on Windows

I have a RX 6900XT GPU, and after installing ROCm 5.7 I followed the instructions to install llama-cpp-python with HIPBLAS=on, but got the error of "Building wheel for llama-cpp-python (pyproject.toml) did not run successfully".

Full error log: llama-cpp-python-hipblas-error.txt

As with the previously closed but unaddressed #1009, my debugging efforts have led me to believe that the wrong C and C++ compilers are being chosen for the cmake build:

MSVC is selected instead of clang

-- Building for: Visual Studio 17 2022
  -- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.19045.
  -- The C compiler identification is MSVC 19.39.33523.0
  -- The CXX compiler identification is MSVC 19.39.33523.0
  -- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe
  -- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - works
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe
  -- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - works
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done

A clang-only option ('-x') is then ignored during compilation

ClCompile:
    C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64\CL.exe /c /I"C:\Users\mbamg\AppData\Local\Temp\pip-install-ps72vnaz\llama-cpp-python_1bb2b9676f39468bba7efbe70e3a1f33\vendor\llama.cpp\." /nologo /W1 /WX- /diagnostics:column /O2 /Ob2 /D _MBCS /D WIN32 /D _WINDOWS /D NDEBUG /D GGML_SCHED_MAX_COPIES=4 /D GGML_USE_LLAMAFILE /D GGML_USE_HIPBLAS /D GGML_USE_CUDA /D GGML_CUDA_DMMV_X=32 /D GGML_CUDA_MMV_Y=1 /D K_QUANTS_PER_ITERATION=2 /D _CRT_SECURE_NO_WARNINGS /D _XOPEN_SOURCE=600 /D __HIP_PLATFORM_HCC__=1 /D __HIP_PLATFORM_AMD__=1 /D "CMAKE_INTDIR=\"Release\"" /Gm- /EHsc /MD /GS /arch:AVX2 /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /std:c11 /Fo"ggml.dir\Release\\" /Fd"ggml.dir\Release\ggml.pdb" /external:W0 /Gd /TC /errorReport:queue  /external:I "C:/Program Files/AMD/ROCm/5.7/include" -x hip "C:\Users\mbamg\AppData\Local\Temp\pip-install-ps72vnaz\llama-cpp-python_1bb2b9676f39468bba7efbe70e3a1f33\vendor\llama.cpp\ggml.c" "C:\Users\mbamg\AppData\Local\Temp\pip-install-ps72vnaz\llama-cpp-python_1bb2b9676f39468bba7efbe70e3a1f33\vendor\llama.cpp\ggml-alloc.c" "C:\Users\mbamg\AppData\Local\Temp\pip-install-ps72vnaz\llama-cpp-python_1bb2b9676f39468bba7efbe70e3a1f33\vendor\llama.cpp\ggml-backend.c" "C:\Users\mbamg\AppData\Local\Temp\pip-install-ps72vnaz\llama-cpp-python_1bb2b9676f39468bba7efbe70e3a1f33\vendor\llama.cpp\ggml-quants.c"
  cl : command line  warning D9002: ignoring unknown option '-x' [C:\Users\mbamg\AppData\Local\Temp\tmpd9u_s_jt\build\vendor\llama.cpp\ggml.vcxproj]

The subsequent argument ('hip') is interpreted as a non-existent source file

 /hip(1,1): error C1083: Cannot open source file: 'hip': No such file or directory [C:\Users\mbamg\AppData\Local\Temp\tmpd9u_s_jt\build\vendor\llama.cpp\ggml.vcxproj]
    (compiling source file '/hip')

The build fails

 "C:\Users\mbamg\AppData\Local\Temp\tmpd9u_s_jt\build\ALL_BUILD.vcxproj" (default target) (1) ->
  "C:\Users\mbamg\AppData\Local\Temp\tmpd9u_s_jt\build\vendor\llama.cpp\ggml.vcxproj" (default target) (4) ->
  (ClCompile target) ->
    /hip(1,1): error C1083: Cannot open source file: 'hip': No such file or directory [C:\Users\mbamg\AppData\Local\Temp\tmpd9u_s_jt\build\vendor\llama.cpp\ggml.vcxproj]
  
      1 Warning(s)
      1 Error(s)
  
  Time Elapsed 00:00:02.54
  
  
  *** CMake build failed

As with the original reporter, I've also tried setting CMake environment variables to force Clang compilation, with no change in result:

[Environment]::SetEnvironmentVariable('CMAKE_ARGS', "-DLLAMA_HIPBLAS=on -DCMAKE_CXX_COMPILER='C:/Program Files/AMD/ROCm/5.7/bin/clang++.exe' -DCMAKE_C_ABI_COMPILED=FALSE -DCMAKE_CXX_ABI_COMPILED=FALSE -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_STANDARD_REQUIRED=ON -DCMAKE_CXX_EXTENSIONS=OFF")

[Environment]::SetEnvironmentVariable('CMAKE_ARGS', "-DLLAMA_HIPBLAS=on -DCXX='C:/Program Files/AMD/ROCm/5.7/bin/clang++.exe' -DCMAKE_C_ABI_COMPILED=FALSE -DCMAKE_CXX_ABI_COMPILED=FALSE -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_STANDARD_REQUIRED=ON -DCMAKE_CXX_EXTENSIONS=OFF")

PS: Reading through #40 it's rather concerning that MSVC might be needed for Windows compilation. Is this still the case?

May 26 '24 23:05 mbamg

I had the same problem, but eventually was able to install it using these flags:

pip install --upgrade --force-reinstall --no-cache-dir llama-cpp-python -C cmake.args="-DAMDGPU_TARGETS=gfx1032 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release"

Note, that I already had Visual Studio installed. Also note that I used the flag -DAMDGPU_TARGETS=gfx1032, because I have an RX 6650XT.

I have also installed the HIP SDK for Windows and the Python package seems to have been installed correctly, HOWEVER when I run the model using LangChain, the program seems to use only my CPU and main memory, GPU usage doesn't change (even though n_gpu_layers is set to 35), and it has the same performance as when installing llama-cpp-python without any flags (slow).

I would be glad if someone could help me figure this out!

May 27 '24 17:05 PlankoAdam

I have the same problem with a RX 7900 XT, I have Visual Studio installed and I am able to get it to work on the CPU but not the GPU.

May 28 '24 09:05 GoodVessel92551

Compiling llama.cpp for HIPBLAS on Windows needs a generator passed to CMake too. I tried -G Ninja at first except it kept building (excruciatingly slow) Debug no matter what I tried, but -G "Ninja Multi-Config" works.

Unfortunately the following workaround results in another error because of how AMD built the hip sdk: https://github.com/abetlen/llama-cpp-python/blob/10b7c50cd2055db575405b8ab3bd9c07979d557a/CMakeLists.txt#L43-L50

CMake tries to install amdhip64.dll into the wheel but can't find it because it's in c:\windows.

After commenting those lines out it builds & runs. This is what I used in the end from a VS x64 Native Tools command prompt:

set CMAKE_ARGS=-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1010 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -G "Ninja Multi-Config"
pip install --force-reinstall ./llama-cpp-python

I also have C:\Program Files\AMD\ROCm\5.7\bin set in my PATH.

May 29 '24 04:05 Engininja2

Compiling llama.cpp for HIPBLAS on Windows needs a generator passed to CMake too. I tried -G Ninja at first except it kept building (excruciatingly slow) Debug no matter what I tried, but -G "Ninja Multi-Config" works.

Unfortunately the following workaround results in another error because of how AMD built the hip sdk:

https://github.com/abetlen/llama-cpp-python/blob/10b7c50cd2055db575405b8ab3bd9c07979d557a/CMakeLists.txt#L43-L50

CMake tries to install amdhip64.dll into the wheel but can't find it because it's in c:\windows.

After commenting those lines out it builds & runs. This is what I used in the end from a VS x64 Native Tools command prompt:
set CMAKE_ARGS=-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1010 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -G "Ninja Multi-Config"
pip install --force-reinstall ./llama-cpp-python
I also have C:\Program Files\AMD\ROCm\5.7\bin set in my PATH.

This work to install it and when I load the model it gets offloaded into the gpu memory

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no
  Device 1: AMD Radeon(TM) Graphics, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU

However when I try to get a response from the model I get this error:

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
  current device: 1, in function ggml_cuda_compute_forward at C:/Code/llama-cpp/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:2360
  err
GGML_ASSERT: C:/Code/llama-cpp/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:100: !"CUDA error"

May 29 '24 09:05 GoodVessel92551

ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no Device 1: AMD Radeon(TM) Graphics, compute capability 10.3, VMM: no

I'm assuming that you used -DAMDGPU_TARGETS=gfx1100 for your 7900 XT, and that the second GPU is an iGPU that isn't gfx1030

Try setting the environment variable HIP_VISIBLE_DEVICES=0 before & when running python so that device#1 is hidden from llama.cpp & rocblas.

May 29 '24 18:05 Engininja2

Try setting the environment variable HIP_VISIBLE_DEVICES=0 before & when running python so that device#1 is hidden from llama.cpp & rocblas.

This worked, Thank You!

May 29 '24 18:05 GoodVessel92551

@Engininja2 You are my hero! Thank you so much!!! I've spent hours and hours trying to figure out how to build this thing with no success. I was literally going crazy. You just saved my life! Kudos to you! I just wonder why devs can never explain properly how to build their own piece of sh*t? Really infuriating!

Jun 03 '24 00:06 Dajinu

Troubleshooting llama-cpp-python with HIPBLAS/ROCm on Windows (AMD RX 5700 XT)

Hello everyone,

This post documents an extensive, but ultimately unsuccessful, attempt to get llama-cpp-python running with full GPU acceleration on an AMD Radeon RX 5700 XT on Windows.

Using an AI system where different "cores" interact. This requires a single, GPU-accelerated model instance to be shared between them to conserve VRAM. The system currently falls back to CPU-only operation, which prevents this.

Confirming the Issue: The first step was to prove the GPU wasn't being used. By setting verbose=True in the Llama() constructor, the logs clearly and consistently showed dev = CPU for all model layers, confirming the installation was not a proper GPU build.

Checking the ROCm/HIP SDK Installation: initially suspected the SDK was missing or misconfigured.

    The rocminfo command failed.

    By inspecting the SDK's bin directory (C:\Program Files\AMD\ROCm\6.2\bin), discovered the correct diagnostic tool for this version was hipInfo.exe.

    Running hipInfo worked perfectly, confirming the SDK was installed and could see the RX 5700 XT.

Targeted Reinstallation: Based on community feedback, attempted a targeted reinstallation with specific flags for the AMD card.

    used set HIP_VISIBLE_DEVICES=0 to isolate the primary GPU.

    used set CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DAMDGPU_TARGETS=gfx1010" to target the RX 5700 XT's specific architecture.

    Despite these flags, the reinstallation still resulted in a CPU-only build.

As a last resort, performed a full manual build from source
    used a x64 Native Tools Command Prompt for VS to ensure the correct C++ compiler was available.

    cloned the repository, set the same targeted environment variables, and ran pip install .

    build completed successfully without errors, indicating that all the necessary tools were found.

Current Status: Unresolved

Despite the successful manual compilation, the final logs show that the library is still loading all model layers onto the CPU.

functional, but it remains limited to CPU performance.

plz help 😢

Jul 15 '25 09:07 giasun