Add llama.cpp backend

Open mfuntowicz opened this issue 1 year ago • 0 comments

This PR is an initial implementation of llama.cpp as potential backend for TGI.

It mostly targets CPU inference in a single/multi stream scheduling fashion, potentially spawning multiple instances of the same model over a non-overlapping subset of the CPU cores.

The current implementation only allows a single request to be running on a working, this constraint will be removed later on. The current implementation also dupplicate the weights for each worker, this constraint can potentially be removed later on.

Nov 04 '24 22:11 mfuntowicz