`completionEndpoint` should support `"stream": true`

Open samm81 opened this issue 2 years ago • 1 comments

this is currently not implemented, so is not following the OpenAI spec

investigation and discovery in the discord here, but the crux of the matter is that completionEndpoint in api/openai.go simply doesn't support streaming

discord thread copy/pasted for accessibility

Details

athousandcups: I'm using my model that is llama.cpp based, and sending "stream": true, but it's not streaming - you can see that it receives "stream": true but then sets Stream:false in the Parameter Config:

1:05AM DBG Request received: {"model":"ggml-model-q4_0.bin","file":"","language":"","response_format":"","size":"","prompt":"a long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0,"max_tokens":632,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
1:05AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-model-q4_0.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:632 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Completion: Chat: Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptStrings:[a long time ago in a galaxy far, far away] InputStrings:[] InputToken:[]}

mudler: can you paste more logs? do you see a "Stream request received" message in the debug calls? token stream works here (see the chatbot-ui example)

athousandcups: I don't see that - here's the full relevant log:

1:05AM DBG Request received: {"model":"ggml-model-q4_0.bin","file":"","language":"","response_format":"","size":"","prompt":"a long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0,"max_tokens":632,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
1:05AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-model-q4_0.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:632 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Completion: Chat: Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptStrings:[a long time ago in a galaxy far, far away] InputStrings:[] InputToken:[]}
1:05AM DBG Loading model 'ggml-model-q4_0.bin' greedly
1:05AM DBG Model 'ggml-model-q4_0.bin' already loaded

llama_print_timings:        load time = 27870.98 ms
llama_print_timings:      sample time =   440.84 ms /   632 runs   (    0.70 ms per token)
llama_print_timings: prompt eval time = 59073.37 ms /   263 tokens (  224.61 ms per token)
llama_print_timings:        eval time = 146611.34 ms /   630 runs   (  232.72 ms per token)
llama_print_timings:       total time = 3903657.38 ms
1:08AM DBG Response: {"object":"text_completion","model":"ggml-model-q4_0.bin","choices":[{"text":" | but actually just yesterday... | in my head at least | I'm not really sure how to a

looking at the code, in completionInput we first call readInput, which correctly prints the Request received: as having "stream":true so somewhere between there and the Parameter Config debug Stream either doesn't get parsed or gets reset to false ok I added a log log.Debug().Msgf("input: %+v", input) and it shows Stream:true it looks like what's happening is readConfig isn't finding a config file, so it's creating a new &Config with OpenAIRequest: defaultRequest(modelFile), and defaultRequest seems to instantiate Stream:false and then updateConfig doesn't seem check if input.Stream is true oh wait! I thikn the problem here is that I'm using the completionEndpoint, but input.Stream is only checked in the chatEndpoint - so streaming is only supported in the chatEndpoint ?

mudler: yes, correct, please open up a new issue, streaming should be supported also on completionEndpoint too

May 29 '23 17:05 samm81

This seems like it is duplicate of #406. I have implemeted it and raised a PR. Please let me know if there are any changes to be made @samm81

May 29 '23 20:05 krishnaduttPanchagnula