text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Add llama.cpp backend

Open mfuntowicz opened this issue 1 year ago • 0 comments

This PR is an initial implementation of llama.cpp as potential backend for TGI.

It mostly targets CPU inference in a single/multi stream scheduling fashion, potentially spawning multiple instances of the same model over a non-overlapping subset of the CPU cores.

The current implementation only allows a single request to be running on a working, this constraint will be removed later on. The current implementation also dupplicate the weights for each worker, this constraint can potentially be removed later on.

mfuntowicz avatar Nov 04 '24 22:11 mfuntowicz