text-generation-inference
text-generation-inference copied to clipboard
Add llama.cpp backend
This PR is an initial implementation of llama.cpp as potential backend for TGI.
It mostly targets CPU inference in a single/multi stream scheduling fashion, potentially spawning multiple instances of the same model over a non-overlapping subset of the CPU cores.
The current implementation only allows a single request to be running on a working, this constraint will be removed later on. The current implementation also dupplicate the weights for each worker, this constraint can potentially be removed later on.