Rodrigo
Rodrigo
> Great, thanks! I have a 12GB vram gpu, does it also work for training ? Currently I can't train, even with a batch size of 1 I get OOM...
The exl2 is the most popular format for quantization, alongside GGUF. It would be amazing to have support for it. I can help with a PR but to be honest...
Thank you @EricLBuehler right now I am just exploring some existing implementations (e.g. https://github.com/chu-tianxiang/vllm-gptq/tree/exl2) and trying to see how we could fit it in.
@EricLBuehler do you think it is a good idea to create a new branch for EXL2 development? I have started with some parts, but it is only a draft for...
This issue is independent from the split parameter. With only a 3080 I can load 40 layers of the model (no split parameter), with two RTX, oom ... (also, no...
I am able to load the model with ```llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --threads 28 --host 0.0.0.0 --port 5001 -c 8192 -ngl 99 -ot exps=CPU ``` : | PID | DEV |...
> Maybe try `-ngl 61` to keep the output layer on the CPU too (that oddly worked for me earlier when I was having trouble with the RPC stuff). No...
> It's trying to allocate a tensor of size 2^64, which suggest there is an integer overflow somewhere. If you set the environment variable `GGML_SCHED_DEBUG=2`, it will print the graph...
> Ok nvm, I think I see the problem. I will push a possible fix soon. I confirm that the fix worked, thank you @slaren. For the record, I am...