llama-cpp-python [INFORMATION REQUEST] Is it possible to build for GPU enabled target on non-GPU host?

This is just a question on the blessed path here

I am wondering if it is possible to build a docker image including llama-cpp-python on a non-GPU host which targets a GPU host?

We build a base docker image that contains llama-cpp-python==0.2.53 by using the following command (relevant portion of dockerfile included for brevity):

ARG CUDA_IMAGE="12.5.0-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE} as base
...
# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1
ENV FORCE_CMAKE=1
ENV CUDACXX="/usr/local/cuda-12.5/bin/nvcc"
ENV CMAKE_CUDA_ARCHITECTURES=80

# Install llama-cpp-python (build with cuda)
ENV CMAKE_ARGS="-DLLAMA_CURL=on -DGGML_CUDA=on -DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-arch=sm_80' -DCMAKE_CXX_FLAGS='-march=znver2'"
RUN pip install "llama-cpp-python==0.2.53" --no-cache-dir --force-reinstall --upgrade
...

We then use this as a base to add our application code in a CICD build step. This code includes guidance if that is important :).

The initial build phase is very manual because we don't yet have GPU hosts available as workers for our CICD build system so need to: manually spin up a GPU VPS, log in, pull down code, build image and push to repo. This is error-prone and hard to automate so we have just started the process of moving this into our CICD system. Before we invest resources in getting a GPU worker integrated into our build system we would like to completely rule out being able to build an image on a non-GPU host that will be able to utilise a GPU when deployed on a GPU host.

Has this been done? Can it? If not, can someone point me in the direction of the technical background as to why not? I'm new to GPU accelerated ML so any info is greatly appreciated.

Nov 25 '24 10:11 m-o-leary

Hey @m-o-leary were you able to figure it out? I am now facing the same issue.

If you have a workaround for it please let me know.

Thanks!

Feb 19 '25 19:02 zeidsolh

@zeidsolh in my testing I was unable to do it.

I wasn't able to build on non GPU and successfully use the GPU at runtime

I didn't explore the prebuilt binaries though so that could be an avenue to try?

Feb 19 '25 19:02 m-o-leary

I didn't explore the prebuilt binaries though so that could be an avenue to try?

What do you mean by prebuilt binaries @m-o-leary ? I would like to give it a try.

Feb 19 '25 19:02 zeidsolh

There's a mention of it here: https://llama-cpp-python.readthedocs.io/en/latest/

In the Supported Backends -> CUDA section

As I said, I haven't explored it at all so it might be a dead end before starting

Feb 19 '25 19:02 m-o-leary

There's a mention of it here: https://llama-cpp-python.readthedocs.io/en/latest/

In the Supported Backends -> CUDA section

As I said, I haven't explored it at all so it might be a dead end before starting

Thanks!

Feb 19 '25 20:02 zeidsolh

@m-o-leary @zeidsolh any luck ?

Feb 25 '25 18:02 devashishraj

Not from me yet but I haven't been able to dedicate a lot of time to it

Feb 25 '25 19:02 m-o-leary

i did try using cuda base image but both building from source and pre-built( ERROR: Could not find a version that satisfies the requirement llama-cpp-python) gave error. So need of gpu seems necessary

Feb 25 '25 19:02 devashishraj

after some testing , with nvidia/cuda:12.8.0-devel-ubuntu24.04. Yeah this is very much possible just size is big

Feb 26 '25 16:02 devashishraj

Would it be possible to share what you've done?

It might be possible to use a multi-stage build with a runtime or base CUDA image as the final stage

Feb 26 '25 20:02 m-o-leary