waan1
waan1
Your fix resolved the reported compile issue, thank you for quick and easy fix. But now it is a runtime issue. I'm not sure if it is a show stopper...
I installed your sources on fresh virtualenv using your scripts. And I've got __version__ = '0.1.99' Do you suggest to uninstall it and install 0.1.97?
Vicuna was loaded from HF I'll try to load a smaller model just in case
You were right again :) tokenizer.model was loaded before I installed lfs. downloaded it again and it works now. Thank you very much for help! And now we know that...
tested chatbot one performance core of CPU (CPU3) is 100% (i9-13900K) other 23 cores are idle P40 is 100%. it took only 9.2GB VRAM out of 24GB. I'm unclear of...
I see. Will be watching your git for updates and will try again when it has something for older PGUs.
> Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance. Tesla P40...
half precision would still require 2 bytes per weight, limiting model selection to 3B or 7B. 13B would not fit into 24GB, right? ideally packing 8 weights x 4 bit...
could you provide example how to use qualtized gguf version? for some reason it does not work with oobabooga text generation
Fist simple example: from faster_whisper import WhisperModel model_size = "large-v3" # Run on GPU with FP16 model = WhisperModel(model_size, device="cuda", compute_type="float16") Gives me an error: Could not load library libcudnn_ops_infer.so.8....