[Feature Request] OpenAI-compatible `stop` param
Using the latest official Docker image, openmmlab/lmdeploy:v0.4.2, I served a Llama 2 model, and sent a request with the stop parameter of the /v1/completions endpoint set to ["\n\n"]. But the generation didn't stop at a double newline. It generated lots of paragraphs with double newlines between them and kept going until it reached the maximum generation length.
I then saw in the docs that the stop param "Only accepts stop words that are encoded to one token index."
Not being able to stop at something simple like "\n\n" in a Llama 2 model is a pretty serious flaw that makes it hard to use this in a production setting. It would be great if the stop param were compatible with OpenAI.
(Also, while I'm here, it would also be very useful to have a include_stop_str_in_output option like in vLLM.)
Thanks!
Thank you for your feedback. I will take a look at include_stop_str_in_output. Since it is not clear whether openai use 'stop str' or 'stop token id', I will look into the behavior of their apis with or without streaming.
it is not clear whether openai use 'stop str' or 'stop token id'
I forgot to mention, but ["the"] works correctly as a stop param, so the fact that ["\n\n"] does not work indicates to me that this issue is related to exact token matching/alignment.
ref https://help.openai.com/en/articles/5072263-how-do-i-use-stop-sequences-in-the-openai-api
ref vLLM include_stop_str_in_output
https://github.com/vllm-project/vllm/blob/c96fc067479453b02e92d9378eeeaebb6b3816de/vllm/sampling_params.py#L176-L183
Also, with vLLM, if finish_reason is "stop", and for example it was due to a stop parameter like stop:["\n"], then the EventStream message JSON also has stop_reason, like this:
...
"finish_reason":"stop",
"stop_reason":"\n",
...
I.e. indicating which stop string caused the stop. This is a handy feature, and nice for compatibility with vLLM (ease of transition), but not strictly necessary if include_stop_str_in_output feature is implemented.
For others who are hitting this issue, but who desperately want to use LMDeploy, you can of course remove the stop parameter, and then manually check for the stop strings in the full generated text each time you receive a new token, and then manually abort the request if one of those stop strings is detected. That's my current workaround.
In LMDeploy, the word in the stop_words list is supposed to be tokenized to ONE token id.
It does not support words that can be tokenized into multiple tokens as stop words now.
We have plans to resolve it. But it will take a while.
Look forward to this feature.
@lvhan028 Do we have plans for this feature now?