llama-cpp-python Can't install with GPU support with Cuda toolkit 12.9 and Cuda 12.9

Description: I am trying to install llama-cpp-python with cuda support however i run into build errors. All the information is attached below. I can install it without GPU support just fine.

Environment:

GPU: Nvidia RTX 5080
OS: Ubuntu 24.04.2 LTS
Python: 3.10/3.12 (tried both)
GCC: 13.3.0
G++: 13.3.0
CUDA: 12.9
Cuda Toolkit: 12.9

Error Log:

` Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... error error: subprocess-exited-with-error

× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [161 lines of output] *** scikit-build-core 0.11.2 using CMake 3.28.3 (wheel) *** Configuring CMake... loading initial cache file /tmp/tmpw194gw2i/build/CMakeInit.txt -- The C compiler identification is GNU 13.3.0 -- The CXX compiler identification is GNU 13.3.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/x86_64-linux-gnu-gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/x86_64-linux-gnu-g++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.43.0") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- Including CPU backend -- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5") -- x86 detected -- Adding CPU backend variant ggml-cpu: -march=native -- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "12.9.41") -- CUDA Toolkit found -- Using CUDA architectures: native -- The CUDA compiler identification is NVIDIA 12.9.41 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- CUDA host compiler is GNU 13.3.0

  -- Including CUDA backend
  CMake Warning at vendor/llama.cpp/ggml/CMakeLists.txt:298 (message):
    GGML build version fixed at 1 likely due to a shallow clone.


  CMake Warning (dev) at CMakeLists.txt:13 (install):
    Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
  Call Stack (most recent call first):
    CMakeLists.txt:97 (llama_cpp_python_install_target)
  This warning is for project developers.  Use -Wno-dev to suppress it.

  CMake Warning (dev) at CMakeLists.txt:21 (install):
    Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
  Call Stack (most recent call first):
    CMakeLists.txt:97 (llama_cpp_python_install_target)
  This warning is for project developers.  Use -Wno-dev to suppress it.

  CMake Warning (dev) at CMakeLists.txt:13 (install):
    Target ggml has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
  Call Stack (most recent call first):
    CMakeLists.txt:98 (llama_cpp_python_install_target)
  This warning is for project developers.  Use -Wno-dev to suppress it.

  CMake Warning (dev) at CMakeLists.txt:21 (install):
    Target ggml has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
  Call Stack (most recent call first):
    CMakeLists.txt:98 (llama_cpp_python_install_target)
  This warning is for project developers.  Use -Wno-dev to suppress it.

  -- Configuring done (6.8s)
  -- Generating done (0.0s)
  -- Build files have been written to: /tmp/tmpw194gw2i/build
  *** Building project with Ninja...
  Change Dir: '/tmp/tmpw194gw2i/build'

  Run Build Command(s): /usr/bin/ninja -v
  [1/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-hbm.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-hbm.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-hbm.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-hbm.cpp
  [2/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-threading.cpp
  [3/150] /usr/bin/x86_64-linux-gnu-gcc -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu11 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wdouble-promotion -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-alloc.c
  [4/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-traits.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-traits.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-traits.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-traits.cpp
  [5/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp
  [6/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/amx/amx.cpp
  [7/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-backend.cpp
  [8/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-opt.cpp
  [9/150] /usr/bin/x86_64-linux-gnu-gcc -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu11 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wdouble-promotion -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-quants.c.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-quants.c.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-quants.c.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-quants.c
  [10/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp
  [11/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_SHARED -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU -DGGML_USE_CUDA -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-backend-reg.cpp
  [12/150] /usr/bin/x86_64-linux-gnu-gcc -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu11 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wdouble-promotion -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml.c
  [13/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/argmax.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o
  FAILED: vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o
  /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/argmax.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o
  /usr/include/c++/13/bits/basic_string.h(3163): error: default argument is not allowed
          substr(size_type __pos = 0, size_type __n = npos) const
                                 ^
            detected during instantiation of class "std::__cxx11::basic_string<_CharT, _Traits, _Alloc> [with _CharT=char32_t, _Traits=std::char_traits<char32_t>, _Alloc=std::allocator<char32_t>]" at line 4510

  /usr/include/c++/13/bits/basic_string.h(3163): error: expected an expression
          substr(size_type __pos = 0, size_type __n = npos) const
                                   ^
            detected during instantiation of class "std::__cxx11::basic_string<_CharT, _Traits, _Alloc> [with _CharT=char32_t, _Traits=std::char_traits<char32_t>, _Alloc=std::allocator<char32_t>]" at line 4510

  /usr/include/c++/13/bits/basic_string.h(3163): error: default argument is not allowed
          substr(size_type __pos = 0, size_type __n = npos) const
                                                    ^
            detected during instantiation of class "std::__cxx11::basic_string<_CharT, _Traits, _Alloc> [with _CharT=char32_t, _Traits=std::char_traits<char32_t>, _Alloc=std::allocator<char32_t>]" at line 4510

  /usr/include/c++/13/bits/basic_string.h(3163): error: expected an expression
          substr(size_type __pos = 0, size_type __n = npos) const
                                                      ^
            detected during instantiation of class "std::__cxx11::basic_string<_CharT, _Traits, _Alloc> [with _CharT=char32_t, _Traits=std::char_traits<char32_t>, _Alloc=std::allocator<char32_t>]" at line 4510

  4 errors detected in the compilation of "/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/argmax.cu".
  [14/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
  [15/150] /usr/bin/x86_64-linux-gnu-gcc -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu11 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wdouble-promotion -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c
  FAILED: vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o
  /usr/bin/x86_64-linux-gnu-gcc -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu11 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wdouble-promotion -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c
  during RTL pass: cprop
  /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_compute_forward_sub’:
  /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:5426:1: internal compiler error: in try_forward_edges, at cfgcleanup.cc:580
   5426 | }
        | ^
  0x108d8f4 internal_error(char const*, ...)
      ???:0
  0x1083cf2 fancy_abort(char const*, int, char const*)
      ???:0
  Please submit a full bug report, with preprocessed source (by using -freport-bug).
  Please include the complete backtrace with any bug report.
  See <file:///usr/share/doc/gcc-13/README.Bugs> for instructions.
  [16/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/diagmask.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/diagmask.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/diagmask.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/diagmask.cu.o
  [17/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/acc.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o
  [18/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/arange.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/arange.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/arange.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/arange.cu.o
  [19/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/fattn.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn.cu.o
  [20/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argsort.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argsort.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/argsort.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argsort.cu.o
  [21/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/count-equal.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/count-equal.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/count-equal.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/count-equal.cu.o
  [22/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/clamp.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/clamp.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/clamp.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/clamp.cu.o
  [23/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cross-entropy-loss.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cross-entropy-loss.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/cross-entropy-loss.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cross-entropy-loss.cu.o
  [24/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv-transpose-1d.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv-transpose-1d.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/conv-transpose-1d.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv-transpose-1d.cu.o
  [25/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/getrows.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/getrows.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/getrows.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/getrows.cu.o
  [26/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/gla.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/gla.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/gla.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/gla.cu.o
  [27/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/concat.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/concat.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/concat.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/concat.cu.o
  [28/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmq.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmq.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/mmq.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmq.cu.o
  [29/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_CPU_AARCH64 -DGGML_USE_LLAMAFILE -DGGML_USE_OPENMP -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cpu_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -march=native -fopenmp -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp
  [30/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/out-prod.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/out-prod.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/out-prod.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/out-prod.cu.o
  [31/150] /usr/bin/x86_64-linux-gnu-g++ -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu++17 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/gguf.cpp
  [32/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/opt-step-adamw.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/opt-step-adamw.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/opt-step-adamw.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/opt-step-adamw.cu.o
  [33/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/im2col.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/im2col.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/im2col.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/im2col.cu.o
  [34/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f16.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f16.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/fattn-tile-f16.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f16.cu.o
  [35/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f32.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f32.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/fattn-tile-f32.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile-f32.cu.o
  [36/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/cpy.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o
  [37/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pad.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pad.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/pad.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pad.cu.o
  [38/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pool2d.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pool2d.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/pool2d.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pool2d.cu.o
  [39/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/convert.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/convert.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/convert.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/convert.cu.o
  [40/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/norm.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/norm.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/norm.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/norm.cu.o
  [41/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/binbcast.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/binbcast.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/binbcast.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/binbcast.cu.o
  [42/150] /usr/bin/x86_64-linux-gnu-gcc -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -O3 -DNDEBUG -std=gnu11 -fPIC -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wdouble-promotion -MD -MT vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o -MF vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o.d -o vendor/llama.cpp/ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-quants.c
  [43/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmv.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmv.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/mmv.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmv.cu.o
  [44/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ggml-cuda.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ggml-cuda.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ggml-cuda.cu.o
  [45/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-wmma-f16.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-wmma-f16.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/fattn-wmma-f16.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-wmma-f16.cu.o
  [46/150] /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_GRAPHS -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/.. -I/tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/../include -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 -arch=native -Xcompiler=-fPIC -use_fast_math -compress-mode=size -Xcompiler "-Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-array-bounds -Wextra-semi -Wno-pedantic" -MD -MT vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o -MF vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o.d -x cu -c /tmp/pip-install-vrjn0p5p/llama-cpp-python_1e2b688442cf40f69f87a94888e22427/vendor/llama.cpp/ggml/src/ggml-cuda/mmvq.cu -o vendor/llama.cpp/ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o
  ninja: build stopped: subcommand failed.


  *** CMake build failed
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects `

Reproduction Steps:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

May 05 '25 15:05 hunainahmedj

It's not working for CUDA 12.8 either. Had to edit CMakeLists.txt to remove refs. to llava. And that was just the beginning of the adventure. I tried putting LLAMA_LLAVA=OFF in the environment, and then again in CMAKE_ARGS. But it ignored the setting.

git pull
git submodule update --remote vendor/llama.cpp

CC=gcc-13 CXX=g++-13 FORCE_CMAKE=1 CMAKE_BUILD_PARALLEL_LEVEL=7 \
CMAKE_ARGS="-DGGML_CUDA=on \
-DCMAKE_CUDA_FLAGS_RELEASE=-Wno-deprecated-gpu-targets \
-DLLAMA_LLAVA=OFF" \
pip install .[server] --upgrade --force-reinstall --upgrade --no-cache-dir


      CMake Error at CMakeLists.txt:150 (set_target_properties):
        set_target_properties Can not find target to add properties to:
        llava_shared

      CMake Error at CMakeLists.txt:169-174 (target_include_directories):
        Cannot specify include directories for target "llava" which is not built by
        this project.

May 25 '25 22:05 opsec-ai

It's not working for CUDA 12.8 either. Had to edit CMakeLists.txt to remove refs. to llava. And that was just the beginning of the adventure. I tried putting LLAMA_LLAVA=OFF in the environment, and then again in CMAKE_ARGS. But it ignored the setting.

git pull
git submodule update --remote vendor/llama.cpp

CC=gcc-13 CXX=g++-13 FORCE_CMAKE=1 CMAKE_BUILD_PARALLEL_LEVEL=7 \
CMAKE_ARGS="-DGGML_CUDA=on \
-DCMAKE_CUDA_FLAGS_RELEASE=-Wno-deprecated-gpu-targets \
-DLLAMA_LLAVA=OFF" \
pip install .[server] --upgrade --force-reinstall --upgrade --no-cache-dir


      CMake Error at CMakeLists.txt:150 (set_target_properties):
        set_target_properties Can not find target to add properties to:
        llava_shared

      CMake Error at CMakeLists.txt:169-174 (target_include_directories):
        Cannot specify include directories for target "llava" which is not built by
        this project.

Exactly,it is LLAVA_BUILD but not LLAMA_LLAVA

More precisely, llama.cpp seems to have renamed the large class of the vl model, resulting in incompatibility between llama-cpp python and it. At present, the downgrade option is to directly disable Llava compilation (without the need to modify CMakeLists. txt)

Additionally, I am CUDA12.4.

May 27 '25 06:05 ChenYFan

Mine runs well on CPU but does not use the GPU. The llama server does. I had to install the Lama server appart because i could never get to work Llama_cpp_python to work with the GPU.

PyTorch version: 2.8.0.dev20250626+cu128 CUDA available: True CUDA version: 12.8 Device name: NVIDIA GeForce RTX 5070 ['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120'] (pygpu) oba@mail:~/code$ python test_llama_cuda.py --- System Information --- Error during llama-cpp-python GPU check: cannot import name '_load_shared_library' from 'llama_cpp.llama_cpp' (/home/oba/miniconda3/envs/pygpu/lib/python3.10/site-packages/llama_cpp/llama_cpp.py) GPU Available: True Total RAM: 62.62 GB Available RAM: 60.14 GB Used RAM: 1.75 GB

load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 1 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 1 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 1 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 1 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 1 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 1 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 1 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 1 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 1 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 1 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 1 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 1 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 1 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q6_K) (and 132 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead load_tensors: CPU_REPACK model buffer size = 921.38 MiB load_tensors: CPU_Mapped model buffer size = 1623.67 MiB

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2028 G /usr/lib/xorg/Xorg 43MiB | | 0 N/A N/A 2275 G /usr/bin/gnome-shell 8MiB | +-----------------------------------------------------------------------------------------+ (pygpu) oba@mail:~/code$

Jul 07 '25 02:07 ObaOzai

oh my god i can't sovle this provlem 저도 이 문재에 막혔어요 아마 호환이 안되는 게 많은듯 EEVE 도 안되는걸 보면

Jul 07 '25 16:07 GPSstub

If confirm i have tried to get Llama_cpp_python to work with my GPU and so far nothing has worked, including compile it several times with several flags. It works well with CPU.

Jul 07 '25 22:07 ObaOzai

Try this:

https://www.youtube.com/watch?v=o5deOXLDpZw&t=779s

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3519 G /usr/lib/xorg/Xorg 100MiB | | 0 N/A N/A 3809 G /usr/bin/gnome-shell 22MiB | | 0 N/A N/A 5031 G .../6436/usr/lib/firefox/firefox 10MiB | | 0 N/A N/A 139334 G ...Ptr --variations-seed-version 40MiB | +-----------------------------------------------------------------------------------------+ (llama-env) oba@mail:~/llama-cpp/llama-cpp-python$ find /usr -name "cuda_runtime.h" 2>/dev/null ls -la /usr/include/cuda* ls -la /usr/local/cuda*/include/cuda_runtime.h 2>/dev/null /usr/include/cuda_runtime.h /usr/local/cuda-12.8/targets/x86_64-linux/include/cuda_runtime.h -rw-r--r-- 1 root root 9340 Jan 28 2023 /usr/include/cuda_awbarrier.h -rw-r--r-- 1 root root 12489 Jan 28 2023 /usr/include/cuda_awbarrier_helpers.h -rw-r--r-- 1 root root 4699 Jan 28 2023 /usr/include/cuda_awbarrier_primitives.h -rw-r--r-- 1 root root 148277 Jan 28 2023 /usr/include/cuda_bf16.h -rw-r--r-- 1 root root 104876 Jan 28 2023 /usr/include/cuda_bf16.hpp -rw-r--r-- 1 root root 39755 Jan 28 2023 /usr/include/cuda_device_runtime_api.h -rw-r--r-- 1 root root 39547 Apr 1 2024 /usr/include/cudaEGL.h -rw-r--r-- 1 root root 37111 Apr 1 2024 /usr/include/cuda_egl_interop.h -rw-r--r-- 1 root root 5645 Jan 28 2023 /usr/include/cudaEGLTypedefs.h -rw-r--r-- 1 root root 140857 Jan 28 2023 /usr/include/cuda_fp16.h -rw-r--r-- 1 root root 98560 Jan 28 2023 /usr/include/cuda_fp16.hpp -rw-r--r-- 1 root root 13358 Jan 28 2023 /usr/include/cuda_fp8.h -rw-r--r-- 1 root root 56491 Jan 28 2023 /usr/include/cuda_fp8.hpp -rw-r--r-- 1 root root 22501 Jan 28 2023 /usr/include/cudaGL.h -rw-r--r-- 1 root root 19150 Jan 28 2023 /usr/include/cuda_gl_interop.h -rw-r--r-- 1 root root 6576 Jan 28 2023 /usr/include/cudaGLTypedefs.h -rw-r--r-- 1 root root 899823 Apr 1 2024 /usr/include/cuda.h -rw-r--r-- 1 root root 4105 Jan 28 2023 /usr/include/cudalibxt.h -rw-r--r-- 1 root root 67179 Jan 28 2023 /usr/include/cuda_occupancy.h -rw-r--r-- 1 root root 8130 Jan 28 2023 /usr/include/cuda_pipeline.h -rw-r--r-- 1 root root 13852 Jan 28 2023 /usr/include/cuda_pipeline_helpers.h -rw-r--r-- 1 root root 8675 Jan 28 2023 /usr/include/cuda_pipeline_primitives.h -rw-r--r-- 1 root root 4566 Jan 28 2023 /usr/include/cuda_profiler_api.h -rw-r--r-- 1 root root 7019 Jan 28 2023 /usr/include/cudaProfiler.h -rw-r--r-- 1 root root 3297 Jan 28 2023 /usr/include/cudaProfilerTypedefs.h -rw-r--r-- 1 root root 2717 Jan 28 2023 /usr/include/cudart_platform.h -rw-r--r-- 1 root root 558852 Apr 1 2024 /usr/include/cuda_runtime_api.h -rw-r--r-- 1 root root 87596 Apr 1 2024 /usr/include/cuda_runtime.h -rw-r--r-- 1 root root 4093 Jan 28 2023 /usr/include/cuda_stdint.h -rw-r--r-- 1 root root 3688 Jan 28 2023 /usr/include/cuda_surface_types.h -rw-r--r-- 1 root root 3688 Jan 28 2023 /usr/include/cuda_texture_types.h -rw-r--r-- 1 root root 99644 Jan 28 2023 /usr/include/cudaTypedefs.h -rw-r--r-- 1 root root 12694 Jan 28 2023 /usr/include/cudaVDPAU.h -rw-r--r-- 1 root root 7727 Jan 28 2023 /usr/include/cuda_vdpau_interop.h -rw-r--r-- 1 root root 4144 Jan 28 2023 /usr/include/cudaVDPAUTypedefs.h

/usr/include/cuda: total 116 drwxr-xr-x 3 root root 4096 Jun 26 23:28 . drwxr-xr-x 57 root root 20480 Jul 1 06:51 .. -rw-r--r-- 1 root root 23822 Dec 15 2022 annotated_ptr -rw-r--r-- 1 root root 435 Dec 15 2022 atomic -rw-r--r-- 1 root root 436 Dec 15 2022 barrier -rw-r--r-- 1 root root 14474 Dec 15 2022 functional -rw-r--r-- 1 root root 434 Dec 15 2022 latch -rw-r--r-- 1 root root 32520 Dec 15 2022 pipeline -rw-r--r-- 1 root root 438 Dec 15 2022 semaphore drwxr-xr-x 3 root root 4096 Jun 26 23:28 std

/usr/include/cuda-gdb: total 92 drwxr-xr-x 2 root root 4096 Jun 26 23:28 . drwxr-xr-x 57 root root 20480 Jul 1 06:51 .. -rw-r--r-- 1 root root 6497 Jan 28 2023 cudacoredump.h -rw-r--r-- 1 root root 47318 Jan 28 2023 cudadebugger.h -rw-r--r-- 1 root root 4093 Jan 28 2023 cuda_stdint.h -rw-r--r-- 1 root root 4341 Jan 28 2023 libcudacore.h -rw-r--r-- 1 oba root 98570 Feb 13 03:29 /usr/local/cuda-12.8/include/cuda_runtime.h -rw-r--r-- 1 oba root 98570 Feb 13 03:29 /usr/local/cuda-12/include/cuda_runtime.h -rw-r--r-- 1 oba root 98570 Feb 13 03:29 /usr/local/cuda/include/cuda_runtime.h (llama-env) oba@mail:~/llama-cpp/llama-cpp-python$ export CC=/usr/bin/gcc export CXX=/usr/bin/g++ export NVCC=/usr/bin/nvcc export CUDA_HOME=/usr/local/cuda-12.8 export CUDA_ROOT=/usr/local/cuda-12.8 export CMAKE_ARGS="-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.8 -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc" pip install . Processing /home/oba/llama-cpp/llama-cpp-python Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting typing-extensions>=4.5.0 (from llama_cpp_python==0.3.12) Using cached typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB) Collecting numpy>=1.20.0 (from llama_cpp_python==0.3.12) Using cached numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (62 kB) Collecting diskcache>=5.6.1 (from llama_cpp_python==0.3.12) Using cached diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Collecting jinja2>=2.11.3 (from llama_cpp_python==0.3.12) Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB) Collecting MarkupSafe>=2.0 (from jinja2>=2.11.3->llama_cpp_python==0.3.12) Using cached MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB) Using cached diskcache-5.6.3-py3-none-any.whl (45 kB) Using cached jinja2-3.1.6-py3-none-any.whl (134 kB) Using cached numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl (16.6 MB) Using cached typing_extensions-4.14.1-py3-none-any.whl (43 kB) Using cached MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB) Building wheels for collected packages: llama_cpp_python Building wheel for llama_cpp_python (pyproject.toml) ... done Created wheel for llama_cpp_python: filename=llama_cpp_python-0.3.12-cp312-cp312-linux_x86_64.whl size=4287484 sha256=f02247be0df857ad27fa09e1df165380e087abab8aee4f9daacba0a22ff97131 Stored in directory: /home/oba/.cache/pip/wheels/5b/fd/89/b164b94d5e48b52fc5ebd1a61f9ea22cebdc966e83bd2cd4c3 Successfully built llama_cpp_python Installing collected packages: typing-extensions, numpy, MarkupSafe, diskcache, jinja2, llama_cpp_python Successfully installed MarkupSafe-3.0.2 diskcache-5.6.3 jinja2-3.1.6 llama_cpp_python-0.3.12 numpy-2.3.1 typing-extensions-4.14.1

(llama-env) oba@mail:~/llama-cpp/llama-cpp-python$ nano check.py (llama-env) oba@mail:~/llama-cpp/llama-cpp-python$ python check.py llama-cpp-python imported successfully! Import test passed Available backends: llama-cpp-python version: 0.3.12 Installation successful! Ready to load models. (llama-env) oba@mail:~/llama-cpp/llama-cpp-python$

Jul 09 '25 22:07 ObaOzai

I used CUDA 12.8.1 with following cmake args and it worked for me, although I do agree 12.9.1 is still not supported due to breaking changes in macros.

          $env:CMAKE_ARGS = '-DGGML_CUDA=on -DLLAVA_BUILD=off -DCMAKE_CUDA_ARCHITECTURES=all'
          $env:CMAKE_ARGS = "-DGGML_CUDA_FORCE_MMQ=OFF $env:CMAKE_ARGS"
          $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX2=off -DGGML_FMA=off -DGGML_F16C=off'

You can find the full config I used to build it from source here, https://github.com/ChamalGomesHSO/artifacts/blob/main/.github/workflows/llama-build-cuda.yaml

Jul 28 '25 06:07 chamalgomes

Llama.cpp is a bit of a hot target right now with daily changes. The stock llama-cpp-python compiled out of the box for me without errors last month, but "it didn't work." Something about the tensor arrays being different shapes. Figured I'd just wait and see later, since there is so much else going on and whatnot.

Jul 30 '25 02:07 opsec-ai

I compiled a version based on cuda12.8.1 and cuda13.0.2 for use with 50 series Blackwell architecture computing cards，maybe you can try https://github.com/JamePeng/llama-cpp-python/releases/tag/v0.3.17-cu128-AVX2-win-20251209 https://github.com/JamePeng/llama-cpp-python/releases/tag/v0.3.17-cu128-AVX2-linux-20251209 https://github.com/JamePeng/llama-cpp-python/releases/tag/v0.3.17-cu130-AVX2-win-20251209 https://github.com/JamePeng/llama-cpp-python/releases/tag/v0.3.17-cu130-AVX2-linux-20251209

Aug 02 '25 03:08 JamePeng

Im trying 12.8 on windows, but my build keeps interrupting randomly throughout the lifespan with this error:

      C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.8.targets(800,9): error MSB4018: The "CudaCompile" 
task failed unexpectedly. [C:\Users\mkp\AppData\Local\Temp\tmpbuupfenq\build\CMakeFiles\4.0.3\CompilerIdCUDA\CompilerIdCUDA.vcxproj]

    C:\Program Files (x86)\Microsoft Visual
    Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA
    12.8.targets(800,9): error MSB4018: System.IO.IOException: The process
    cannot access the file
    'C:\Users\mkp\AppData\Local\Temp\tmpaa006c04ce474a069019d89efb2c3c66.cmd'
    because it is being used by another process.
    [C:\Users\mkp\AppData\Local\Temp\tmpbuupfenq\build\CMakeFiles\4.0.3\CompilerIdCUDA\CompilerIdCUDA.vcxproj]

No amount of different flags fixes it, msvs 2019/2022. Lava, mingw, etc. The only way to get it to build properly is through CPU. The biggest thing is its not consistent between invocations. (or actually caching any objects even with ccache)

I think windows Storage Sense is deleting temp files out from under the compiler. Seems to be going a lot more smooth after I disabled it.

Aug 03 '25 18:08 tilkinsc

The same problem I've suffered for 2 days. The task is to run GGUF models of the third generation of Gemma 3, since there are no problems with Gemma 2, but Cuda 12.9 doesn't work.

Aug 11 '25 13:08 Vlalika

I ended up just compiling llama.cpp and using easy_llama. It's a lot better than I initially anticipated, as this repository is a year out of date. The only thing wrong with it is that its high level classes are so borderline useless that you'll use the exposed apis to do most of the tedious work.

Aug 12 '25 17:08 tilkinsc

There is no task to use rtx3090, everything worked for me, but I can't go beyond Gemma 2 or other 2nd generation AI when modern models run on the same computer via LM Studio, but I need to run them in python for my tasks.

Aug 12 '25 18:08 Vlalika

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23305 MiB free llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from gemma-3-12b-it-Q6_K.gguf (version GGUF V3 (latest)) ..................... File "C:\Users\Model24\chat\chat99x.py", line 112, in _init_model return Llama( File "C:\Users\Model24\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_cpp\llama.py", line 369, in init internals.LlamaModel( File "C:\Users\Model24\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_cpp_internals.py", line 56, in init raise ValueError(f"Failed to load model from file: {path_model}") ValueError: Failed to load model from file: gemma-3-12b-it-Q6_K.gguf

===== Who can tell me which version of Llama will work in V3 with versions of gguf models?

V2 no problem

Aug 12 '25 21:08 Vlalika

i suspect the issue is that it might work with 12.8 but not 12.9. <-- corrected to me seems the issue is the sm_120 is not compatible in my case.

tf) oba@mail:~/scripts$ ls *py apache_db_agumented.py chkcuda_llamacpppython.py cnn_gru_nidsV3.py ip_weather.py logchkV1.py ssh_cnn_gru_trainv1.py apache_db_agumentedV1.py chkpythorch.py cudatest.py list_blocked.py logchkV2.py ssh_cnn_gru_trainv2.py apache_log_explorer.py chk_tf_gpu.py drop_dups_list_blocked_list.py logcheck_short1.py lstm_auth_log.py ssh_cnn_trainV3.py apache_log_parser.py cnn_gru_nids_nostop.py drop_dups_list_blocked.py logcheck_short.py playwright_test.py test_tf_keras.py apache_realtime_cnn_gru_monitor.py cnn_gru_nids.py fake.py logcheck_shortv1.py realtime_monitor.py torch_info.py apache_realtime_cnn_gru_monitorV1.py cnn_gru_nidsV1.py geo_locate_ip.py logcheck_shortv2.py saygpu.py untitled.py apache_realtime_cnn_gru_monitorV2.py cnn_gru_nidsV2.py ipblock.py logchk.py ssh_cnn_gru_train.py (tf) oba@mail:~/scripts$ python saygpu.py ['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_37', 'sm_90'] Tue Aug 12 15:56:16 2025
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5070 Ti On | 00000000:01:00.0 Off | N/A | | 0% 36C P8 16W / 300W | 77MiB / 16303MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 113881 G /usr/lib/xorg/Xorg 43MiB | | 0 N/A N/A 114062 G /usr/bin/gnome-shell 10MiB | +-----------------------------------------------------------------------------------------+

Aug 12 '25 22:08 ObaOzai

I apologize, i meant to say the sm_120 does not seem to be compatible. I dont want to downgrade, because getting the sm_120 working took me some effort also. I think there are also issues with pytorch GPU and that sm_120.

/home/oba/miniconda3/envs/tf/lib/python3.10/site-packages/torch/cuda/init.py:287: UserWarning: NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_37 sm_90. If you want to use the NVIDIA GeForce RTX 5070 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Aug 13 '25 01:08 ObaOzai

I noticed that version CUDA has nothing to do with LLAMA's work. Since if we look at LM studio, their software works without problems with any version, perhaps the problem lies in LLAMA itself, since I have not yet understood what the benefit of the creators is, and they have no obvious interest in deeply developing this product.
y the way, it's useless to ask DEEPSEEK or another AI — they will lead you in circles, offering to do this or that, reinstall something, this is a dark forest, it will only get worse, since their knowledge is based on reading the same chats, where everyone offers their own option, it's really a dark forest with lots of paths, so if someone knows specific builds on how to run AI v3, then I would be extremely grateful.

Aug 13 '25 05:08 Vlalika

I tried many ways, seems we all confirm what you stated. I only have one Nvida Card and only one System.
Linux 6.14.0-27-generic #27~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 22 17:38:49 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Im 100 % sure is not compatible with this system using GPU. Works fine on CPU.

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3126 G /usr/lib/xorg/Xorg 43MiB | | 0 N/A N/A 4333 G /usr/bin/gnome-shell 10MiB | +-----------------------------------------------------------------------------------------+

Aug 13 '25 20:08 ObaOzai

CUDA: Since Visual Studio 2022

currently doesn't support BMI2 on Haswell+,
& versions newer than 2022 don't work with CUDA 13...

The resulting binaries could be twice as fast with some other compiler. Trying clang. Ref. https://github.com/ggml-org/llama.cpp/issues/534

My attempt at replacing MSVC with clang might not work : clang-cl is just Clang pretending to be MSVC. The old 2022 build scripts probably won't find BMI2. We'll see... testing:

 winget install winget-choco-manager
choco install ninja
choco install LLVM
# exit and restart pwsh. verify if 'C:\Program Files\LLVM\bin'  is in %PATH%
clang --version 

PS C:\Users\Test\Downloads\llama\llama.cpp> cmake -B build `                             
 -DGGML_CUDA=on `
 -G "Visual Studio 17 2022" -A x64 `
   -DCMAKE_BUILD_TYPE=Release `
   -DCMAKE_C_COMPILER=clang-cl `
   -DCMAKE_CXX_COMPILER=clang-cl `
   -DGGML_CUDA=ON `
   -DCMAKE_CUDA_ARCHITECTURES=86 `
 -DCURL_INCLUDE_DIR="C:/Users/Test/Downloads/vcpkg/packages/curl_x64-windows/include" `
 -DCURL_LIBRARY="C:/Users/Test/Downloads/vcpkg/packages/curl_x64-windows/lib/libcurl.lib"

Dec 14 '25 05:12 opsec-ai