Response streaming support for ollama

Open abdonkov opened this issue 10 months ago • 1 comments

I think that response streaming will be a great addition overall, but especially for the ollama integration. As Gemini and OpenAI are in the cloud and use some beefy hardware, they are pretty fast overall, and the waiting time is short. However, when using ollama locally in most cases it will be slower and can be a bit frustrating.

Specifically when summarizing something, you have to wait for the the response to complete, and even with a decent hardware the wait time is not fun.

For example on my machine I can easily achieve like 50-60 tokens per second, which is more than enough as I definitely cant read that fast. But when generating a summary for something longer, I have to wait like 10-15 seconds for the response. And currently what I end up doing is just copy, start new conversation in open web ui, use my summarization prompt, and as there is streaming it is still faster with all the extra steps.

So I think, it will have a huge benefit for the usability if streaming is implemented as you will be able to see the response immediately and start reading while it is generating.

I see that you are using the ollama python library for the integration already, so the implementation should be easy as their api is pretty straightforward.

Mar 10 '25 14:03 abdonkov

Thanks for reaching out! I’ll work on adding response streaming to the pop-up window responses in the next (or an upcoming) version :)

Mar 10 '25 19:03 theJayTea