cduk
cduk
Just wondering if there were any developments on this front. I guess my use case is a simple one as I have around 3-4 million embeddings to index which is...
I had the same problem. Used git to download files and repo used LFS which meant files contained only pointers to the real filess. A quick view of the file...
> For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub: https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit Would you mind showing how you made the...
@antheas So close! Have you considered quantizing to 8-bit and seeing how well that works? I wonder whether 8bit 7B would out-perform fp16 3B. Both seem like they would fit...
I don't know if this could be helpful: https://github.com/tpope/vim-dispatch they seem to have async stuff running in the background. For LLMs it is maybe more complex as it impacts the...
The simpler way would be not do deal with loading and unloading and require all models fit in VRAM and then you select which one you use in the API...
What changes did you plan to make with the tokenizer?
From google translate: >When we started using it in 2017, we only used master and volume. Recently, we wanted to reorganize the files within the file system. However, we don't...
I used a RTX 3090 (24GB VRAM). I will quantize offline, this is anyway more efficient.