llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Enable / Add llama.cpp.python server --parallel | -np N | --n_parallel N

Open eshafaq1 opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When executing chat completions I need to wait for a prompt to complete before executing a new one. I'd like to be able to execute multiple prompts at the same time. Right now my GPU utilization on a g4dn.2xlarge instance is max 65-80% (model loaded in GPU mem) I've tinkered around with n_batch, ctx and a few other parameters.

Describe the solution you'd like A clear and concise description of what you want to happen. It seems like llama.cpp has this feature already -np N, --parallel N: Set the number of slots for process requests. Default: 1 https://github.com/ggerganov/llama.cpp/tree/master/examples/server

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

eshafaq1 avatar Apr 05 '24 05:04 eshafaq1

cc @dorsey-crogl

eshafaq1 avatar Apr 05 '24 05:04 eshafaq1

Any update to this ?

AnonymousVibrate avatar Jun 25 '24 05:06 AnonymousVibrate

@abetlen would this be possible? would really need the parallel processing feature...

Backendmagier avatar Sep 04 '24 16:09 Backendmagier