localGPT CUDA out of memory error

Hi everyone,

i am currently trying to use localGPT for a project and i encountered a problem.

Basically i have two setup :

my home setup with : i5 8600K, 32Gb DDR4 and an RTX 2080
my work setup with : i7 8700k , 128Gb DDR4 and an Nvidia A2

in both setup localGPT was installed the same way and everything. When i run the ingest.py code i get no error whatsoever, it is when i run the main program that i encounter problems.

Everything work perfectly on my home setup, but on my work setup i run on this error : torch.cuda.outofMemoryError . Even though i have more Vram on the A2. Also i didn't change the model i use the base one which is "TheBloke/vicuna-7B-1.1-HF"

Do you guys know what's wrong ?

Here is the full error :

Traceback (most recent call last): File "C:\Users\Ali_I\Documents\LocalGPT\localGPT\run_localGPT.py", line 235, in main() File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1130, in call return self.main(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 760, in invoke return __callback(*args, **kwargs) File "C:\Users\Ali_I\Documents\LocalGPT\localGPT\run_localGPT.py", line 213, in main res = qa(query) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 140, in call raise e File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\retrieval_qa\base.py", line 120, in _call answer = self.combine_documents_chain.run( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 239, in run return self(kwargs, callbacks=callbacks)[self.output_keys[0]] File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 140, in call raise e File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\combine_documents\base.py", line 84, in _call output, extra_return_dict = self.combine_docs( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\combine_documents\stuff.py", line 87, in combine_docs return self.llm_chain.predict(callbacks=callbacks, **inputs), {} File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\llm.py", line 213, in predict return self(kwargs, callbacks=callbacks)[self.output_key] File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 140, in call raise e File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\llm.py", line 69, in _call response = self.generate([inputs], run_manager=run_manager) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\llm.py", line 79, in generate return self.llm.generate_prompt( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\llms\base.py", line 134, in generate_prompt return self.generate(prompt_strings, stop=stop, callbacks=callbacks) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\llms\base.py", line 191, in generate raise e File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\llms\base.py", line 185, in generate self._generate(prompts, stop=stop, run_manager=run_manager) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\llms\base.py", line 436, in _generate self._call(prompt, stop=stop, run_manager=run_manager) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\llms\huggingface_pipeline.py", line 168, in _call response = self.pipeline(prompt) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\pipelines\text_generation.py", line 201, in call return super().call(text_inputs, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\pipelines\base.py", line 1120, in call return self.run_single(inputs, preprocess_params, forward_params, postprocess_params) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\pipelines\base.py", line 1127, in run_single model_outputs = self.forward(model_inputs, **forward_params) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\pipelines\base.py", line 1026, in forward model_outputs = self._forward(model_inputs, **forward_params) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\pipelines\text_generation.py", line 263, in _forward generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\generation\utils.py", line 1522, in generate return self.greedy_search( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\generation\utils.py", line 2339, in greedy_search outputs = self( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py", line 688, in forward outputs = self.model( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py", line 578, in forward layer_outputs = decoder_layer( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "C:\Users\Ali_I\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\llama\modeling_llama.py", line 212, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB (GPU 0; 14.84 GiB total capacity; 13.94 GiB already allocated; 77.19 MiB free; 13.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jul 04 '23 13:07 Alio241

the error is caused because you have run out of memory to solve this you could try to use a smaller model allocate more vram

Jul 09 '23 05:07 KanAvR

I have exactly the same issue. As an inexperienced user, I am looking for more detailed instructions to resolve this issue. Running Ubuntu 22.04 on Lenovo ThinkStation P620 with RTX A2000 graphics card and AMD Ryzen Threadripper Prop 5955WX Processor.

Jul 19 '23 13:07 mingyuwanggithub

it seems that the ingest.py dose not return the Vram it use. I get out of memory errors if I repeatedly run it and I have 24Gb of Vram.

Jul 23 '23 08:07 kolergy

@kolergy Did you come up with a correction/modification for it? Thanks.

Jul 24 '23 12:07 mingyuwanggithub

I'm not sure to have the competences but I'm trying to find where it comes from, it seems to be deeper than the ingest maybe in the vector store call. I'll keep you updated if i find something useful.

Jul 24 '23 14:07 kolergy

@mingyuwanggithub The documents are all loaded, then split into chunks then embedding are generated all without using the GPU.
The VRAM usage seems to come from the Duckdb, which to use the GPU to probably to compute the distances between the different vectors. to test it I took around 700mb of PDF files which generated around 320 kb of actual text it used around 7.7GB of VRAM to process the text from the documents with a brand new DB, and appropriately returned the VRAM after DB is set to None.

But as I ran it again with the exact same documents the VRAM requirements increased to 9.2GB. database contained 35k embedding when loaded, Third time it used 7.9GB VRAM and the DB loaded 71k embedding, on the fourth round it used 8.8GB of VRMAM and the DB loaded 107K embedding.

So I did not manage to reproduce the CUDA out of memory that I experienced. At that moment I had experienced a few crashes due to pdf documents that where not liked by the parser and moments where I had manually killed the task maybe it did not like that. I do not know.

In conclusion apparently it seems to try add the new docs to the database without checking if they are already there. increasing the number of embedding in the DB but not really increasing the VRAM requirements nor the time to process the documents. However it seems to be extremely memory hungry as 7.9GB of VRAM for 320k of text seems a lot.

it might be good to stage the inputs of text to the DB to stay within the bounds of the VRAM? It might still be good to have a check if a doc has been ingested and avoid ingesting it again?

Jul 24 '23 23:07 kolergy

@kolergy Thank you for making the effort to investigate the GPU memory issue. It does not appear to be a straightforward one. Please update this thread if you do have a breakthrough!

Jul 26 '23 02:07 mingyuwanggithub

Any solutions?

Jul 25 '24 15:07 KansaiUser