Andrei
Andrei
@jmtatsch @MillionthOdin16 thank you! I still have a few questions on the best way to implement this, appreciate any input. The basic features would allow you to: - Specify a...
Implemented in #931
Can you check out and test this llama.cpp commit with OPENBLAS (this is what v0.1.32 is based on) https://github.com/ggerganov/llama.cpp/tree/684da25926e5c505f725b4f10b5485b218fa1fc7 to confirm, also compare to the latest llama.cpp
@Bloob-beep but without the chain ie `llm(prompt)` doesn't give this error? Very strange
@Niek do you mind moving this to the build release workflow?
@Niek if possible can we include @jmtatsch nvidia-docker container example as well in this PR? Ability to docker pull and run a GPU-accelerated container would be very helpful.
@Niek finally got a chance to merge this, great work! We now have a docker image. @jmtatsch if you're still interested it would be awesome to get that cuBLAS-based image,...
@oobabooga thanks, I'll add the option like that. The biggest recent performance improvement has been the OpenBLAS / cuBLAS / clBlast support added to llama.cpp, those can be enabled by...
@oobabooga merged in the changes for the gpu offloading and tested this out, works well for me. Made a small change to the cache capacity parsing to default to bytes...
**EDIT**: It works but I also needed to add the `-DCMAKE_CXX_FLAGS=-fPIC` and `-DCMAKE_C_FLAGS=-fPIC` to avoid the error below. @thomasantony thanks, I wasn't aware of those flags, however this doesn't seem...