paddler icon indicating copy to clipboard operation
paddler copied to clipboard

Sticky Sessions

Open mcharytoniuk opened this issue 1 year ago • 3 comments

It should be possible to always direct requests to a specific slot and distribute them among all the observed servers.

Once a request is issued, all the following requests should land in the specific slot.

It can be implemented with a cookie.

See also:

  • https://github.com/trollkotze/llama.cpp/tree/control-vector-server
  • https://github.com/ggerganov/llama.cpp/pull/6289

mcharytoniuk avatar Aug 06 '24 15:08 mcharytoniuk

I can take a crack at this in the next few days. @mcharytoniuk, I just had a couple questions about this:

  1. I'm a bit confused what this has to do with adding control vectors in the server. What is the relationship between control vectors and sticky sessions?
  2. What benefit would sticky sessions really provide? The slots are released after completion so prompts are processed independently and are stateless. Is this supposed to be some kind of performance optimization?

VJHack avatar Sep 16 '24 02:09 VJHack

I can take a crack at this in the next few days.

That would be awesome @VJHack . :)

  1. I'm a bit confused what this has to do with adding control vectors in the server. What is the relationship between control vectors and sticky sessions?
  2. What benefit would sticky sessions really provide? The slots are released after completion so prompts are processed independently and are stateless. Is this supposed to be some kind of performance optimization?

According to that llamacpp PR about control vectors - after it is merged to llamacpp it should be possible to configure each slot with a different control vector. There can be a scenario where user configures several llamacpp servers in the same way - with a specific control vector at the same slot slot number at each of them.

Then, when a specific cookie (or something similar) is present in the request, that request can be balanced only between slots with that specific control vector.

Overall I have been thinking either about using cookies, or to create some Paddler specific endpoints that allow to tag specific slots, and then issue requests to slots that are configured with that specific tag.

mcharytoniuk avatar Sep 16 '24 11:09 mcharytoniuk

Sorry, I'd love to work on this but I'm quite busy at the moment. If someone else wants to take it on, they can. I'll come back and revisit it if it's still open later.

VJHack avatar Sep 18 '24 03:09 VJHack

KV cache improvements are planned for 2.1

mcharytoniuk avatar Aug 08 '25 20:08 mcharytoniuk