localGPT icon indicating copy to clipboard operation
localGPT copied to clipboard

BLAS = 0 Always

Open erswelljustin opened this issue 2 years ago • 9 comments

Hi @PromtEngineer

I have followed the README instructions and also watched your latest YouTube video, but even if I set the --device_type to cuda manually when running the run_localGPT.py or run_localGPT_API the BLAS value is alwaus shown as BLAS = 0

I am running Ubuntu 22.04 and an NVidia RTX 4080. This is my lspci output for reference.

        VGA compatible controller: NVIDIA Corporation Device 2704 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5112
	Flags: bus master, fast devsel, latency 0, IRQ 164
	Memory at 80000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 4000000000 (64-bit, prefetchable) [size=16G]
	Memory at 4400000000 (64-bit, prefetchable) [size=32M]
	I/O ports at 4000 [size=128]
	Expansion ROM at 81000000 [virtual] [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

I am using the following models in the constants.py

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF"
MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

Can you advise as currently it runs of of the CPU and ideally I'd like it to run off the very capable GPU.

Thanks!

erswelljustin avatar Sep 23 '23 14:09 erswelljustin

GGUF (Formerly GGML) is only for CPU. If you are using CUDA you need the GPTQ models

thebetauser avatar Sep 23 '23 21:09 thebetauser

In my experience in Ubuntu 22.04, BLAS=0 happened when my build of llama-ccp failed to find my cuda-toolkit including cublas.h installation in an Anacoda environment. I had --verbose flag to see the logs.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --no-cache-dir --verbose

rcantada avatar Sep 23 '23 23:09 rcantada

@erswelljustin As mentioned above, GGUF is a great option if you are running localGPT on Apple silicon or CPU. If you have access to an NVIDIA GPU, I would recommend to use GPTQ models. Also check if you have pytorch installed and have access to CUDA. In the same virtual evn, open python and run this code:

import torch print(torch.cuda.is_available())

PromtEngineer avatar Sep 24 '23 04:09 PromtEngineer

This allowed me to use cpu and gpu simultaniously with GGUF, for windows :

set the environment variable properly : $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" $Env:FORCE_CMAKE="1"

check that it works echo $Env:CMAKE_ARGS

uninstall previous version of llama-cpp-python pip uninstall llama-cpp-python

install the proper version : pip install llama-cpp-python==0.1.83 --no-cache-dir

@erswelljustin I would say, check your llama-cpp-python version.

EISMANN-DEV avatar Sep 24 '23 08:09 EISMANN-DEV

Thanks all for your help I will report back

erswelljustin avatar Sep 24 '23 13:09 erswelljustin

@PromtEngineer I am trying to use one of the Models as suggested in the constants.py for GPTQ as per your reply. I have also checked that torch.cuda.is_available(), which it is, however, I am getting an error that says:

FileNotFoundError: Could not find model in TheBloke/WizardLM-7B-uncensored-GPTQ

It is true that this isn't in the models folder but I felt sure the tutorial said that the model would be downloaded. I have uncommented lines 158 & 159 and commented out lines 98 & 99 of constants.py and I am running python3 run_localGPT.py --device_type cuda --show_sources --use_history

erswelljustin avatar Sep 24 '23 14:09 erswelljustin

@PromtEngineer I am trying to use one of the Models as suggested in the constants.py for GPTQ as per your reply. I have also checked that torch.cuda.is_available(), which it is, however, I am getting an error that says:

FileNotFoundError: Could not find model in TheBloke/WizardLM-7B-uncensored-GPTQ

It is true that this isn't in the models folder but I felt sure the tutorial said that the model would be downloaded. I have uncommented lines 158 & 159 and commented out lines 98 & 99 of constants.py and I am running python3 run_localGPT.py --device_type cuda --show_sources --use_history

I have updated the MODEL_BASMENAME to "model.safetensors" and it is working now thanks for your help

erswelljustin avatar Sep 24 '23 16:09 erswelljustin

For Windows, BLAS=0 if we keep the doble quotation marks on, It works with GPU and shows BLAS=1 if we use without double quotation marks. Below worked for me:

setx CMAKE_ARGS -DLLAMA_CUBLAS=on setx FORCE_CMAKE 1 pip install llama-cpp-python==0.1.83 --no-cache-dir

sanjeevzt avatar Oct 19 '23 08:10 sanjeevzt

This allowed me to use cpu and gpu simultaniously with GGUF, for windows :

set the environment variable properly : $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" $Env:FORCE_CMAKE="1"

check that it works echo $Env:CMAKE_ARGS

uninstall previous version of llama-cpp-python pip uninstall llama-cpp-python

install the proper version : pip install llama-cpp-python==0.1.83 --no-cache-dir

@erswelljustin I would say, check your llama-cpp-python version.

This helped with my issue.

OldFansBG avatar Jun 11 '24 15:06 OldFansBG