Enable / Add llama.cpp.python server --parallel | -np N | --n_parallel N
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
When executing chat completions I need to wait for a prompt to complete before executing a new one. I'd like to be able to execute multiple prompts at the same time. Right now my GPU utilization on a g4dn.2xlarge instance is max 65-80% (model loaded in GPU mem) I've tinkered around with n_batch, ctx and a few other parameters.
Describe the solution you'd like A clear and concise description of what you want to happen. It seems like llama.cpp has this feature already -np N, --parallel N: Set the number of slots for process requests. Default: 1 https://github.com/ggerganov/llama.cpp/tree/master/examples/server
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
cc @dorsey-crogl
Any update to this ?
@abetlen would this be possible? would really need the parallel processing feature...