Alfredo Ortega
Alfredo Ortega
I think its likely because of this code: params = {} params['device_map'] = {'': 0} #params['dtype'] = shared.model.dtype shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params) See how it resets the 'device_map' that...
I confirm that @sgsdxzy patch now successfully loads alpaca-lora-30b on 2x3090 GPUs using int8 quantization
I'm on 8747c74339cf1e7f1d45f4aa1dcc090e9eba94a3, now it loads Lora and 30b in 2x3090, no problem.
Set up at least a 40G swap, it need about 130G of memory for merging 30B
Try adding --auto-devices
Maybe increase swap file? I have same setup but 96 GB RAM and it uses swap.
Ok found the problem. This PR fixes it https://github.com/artidoro/qlora/pull/44 But it is not yet merged (even if a comment says it is)
Yes, here's an html example. I'm using Lazarus 3.0.0, FPC 3.2.2 and Ubuntu (but the bug happens in Windows too): ``` body{background-color:white;}table{width:100%;margin:0 auto;} td{width:100%;word-wrap:break-word;}pre{}write a list of 10 words Abundance...
I can use it and it works, but its slightly slower, 9tok/s activated, 11.5 tok/s deactivated, inference on Llama3-70B-8bpw, 4x3090 gpu.
I hit a similar bug: Environment: 4x3090, Cuda 12.4, Aphrodite 0.53, total 96GB of VRAM, tensor parallel=4. When I try to load elinas_Meta-Llama-3-120B-Instruct-4.0bpw-exl2 (61GB), it runs out of VRAM instantly,...