llama-cpp-python Enable / Add llama.cpp.python server --parallel | -np N

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When executing chat completions I need to wait for a prompt to complete before executing a new one. I'd like to be able to execute multiple prompts at the same time. Right now my GPU utilization on a g4dn.2xlarge instance is max 65-80% (model loaded in GPU mem) I've tinkered around with n_batch, ctx and a few other parameters.

Describe the solution you'd like A clear and concise description of what you want to happen. It seems like llama.cpp has this feature already -np N, --parallel N: Set the number of slots for process requests. Default: 1 https://github.com/ggerganov/llama.cpp/tree/master/examples/server

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Apr 05 '24 05:04 eshafaq1

cc @dorsey-crogl

Apr 05 '24 05:04 eshafaq1

Any update to this ?

Jun 25 '24 05:06 AnonymousVibrate

@abetlen would this be possible? would really need the parallel processing feature...

Sep 04 '24 16:09 Backendmagier

Enable / Add llama.cpp.python server --parallel | -np N | --n_parallel N