`completionEndpoint` should support `"stream": true`
this is currently not implemented, so is not following the OpenAI spec
investigation and discovery in the discord here, but the crux of the matter is that completionEndpoint in api/openai.go simply doesn't support streaming
discord thread copy/pasted for accessibility
Details
athousandcups: I'm using my model that is
llama.cppbased, and sending"stream": true, but it's not streaming - you can see that it receives"stream": truebut then setsStream:falsein theParameter Config:
1:05AM DBG Request received: {"model":"ggml-model-q4_0.bin","file":"","language":"","response_format":"","size":"","prompt":"a long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0,"max_tokens":632,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
1:05AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-model-q4_0.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:632 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Completion: Chat: Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptStrings:[a long time ago in a galaxy far, far away] InputStrings:[] InputToken:[]}
mudler: can you paste more logs? do you see a "Stream request received" message in the debug calls? token stream works here (see the chatbot-ui example)
athousandcups: I don't see that - here's the full relevant log:
1:05AM DBG Request received: {"model":"ggml-model-q4_0.bin","file":"","language":"","response_format":"","size":"","prompt":"a long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0,"max_tokens":632,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
1:05AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-model-q4_0.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:632 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Completion: Chat: Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptStrings:[a long time ago in a galaxy far, far away] InputStrings:[] InputToken:[]}
1:05AM DBG Loading model 'ggml-model-q4_0.bin' greedly
1:05AM DBG Model 'ggml-model-q4_0.bin' already loaded
llama_print_timings: load time = 27870.98 ms
llama_print_timings: sample time = 440.84 ms / 632 runs ( 0.70 ms per token)
llama_print_timings: prompt eval time = 59073.37 ms / 263 tokens ( 224.61 ms per token)
llama_print_timings: eval time = 146611.34 ms / 630 runs ( 232.72 ms per token)
llama_print_timings: total time = 3903657.38 ms
1:08AM DBG Response: {"object":"text_completion","model":"ggml-model-q4_0.bin","choices":[{"text":" | but actually just yesterday... | in my head at least | I'm not really sure how to a
looking at the code, in
completionInputwe first callreadInput, which correctly prints theRequest received:as having"stream":trueso somewhere between there and theParameter ConfigdebugStreameither doesn't get parsed or gets reset tofalseok I added a loglog.Debug().Msgf("input: %+v", input)and it showsStream:trueit looks like what's happening isreadConfigisn't finding a config file, so it's creating a new&ConfigwithOpenAIRequest: defaultRequest(modelFile), anddefaultRequestseems to instantiateStream:falseand thenupdateConfigdoesn't seem check ifinput.Streamistrueoh wait! I thikn the problem here is that I'm using thecompletionEndpoint, butinput.Streamis only checked in thechatEndpoint- so streaming is only supported in thechatEndpoint?
mudler: yes, correct, please open up a new issue, streaming should be supported also on completionEndpoint too
This seems like it is duplicate of #406. I have implemeted it and raised a PR. Please let me know if there are any changes to be made @samm81