Your GPU is probably not used at all, which would explain the slow speed in answering.

Open thomasmeneghelli opened this issue 1 year ago • 1 comments

Please help me to configure BLAS=1, on RTX 3070, WIN 11. I have llama-cpp-python==0.2.23 --no-cache-dir

Thank you so much ·························

You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.

As your GPU only has 6 GB it will probably not be useful for any reasonable model.

For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.

Originally posted by @KonradHoeffner in https://github.com/PromtEngineer/localGPT/discussions/231#discussioncomment-6594143

Feb 18 '24 23:02 thomasmeneghelli

Just for clarity, GGUF models are quantized models and are meant to run on CPU and memory. if you want to use the model over the GPU you must select HF (Huggingface)models in this code which requires your account and HF token for login while downloading the model (first time only).

Mar 01 '24 12:03 TechInnovate01