server Configurable rate-limiting / queue policy for sequence batcher

Is your feature request related to a problem? Please describe. As documented here, the dynamic batcher seems to be able to configure a queue policy. There doesn't seem to be any equivalent to this for sequence batching. This means that we cannot disable queueing, which in turn means that busy triton servers can accept requests from clients even when they are unable to perform inference currently - which makes load balancing across multiple triton servers non-trivial.

Describe the solution you'd like Ideally, there would be some configurable options for the sequence batcher to control the queueing behaviour, including disabling queueing and rejecting new connections.

Describe alternatives you've considered

gRPC proxy with connection limits
- It's not always possible to determine a limit ahead of time
- Doesn't account for channel reuse
Retry connection on client side if a certain timeout is exceeded while queueing
- This introduces latency to inferences

Mar 08 '23 12:03 aw1cks

Thanks for the feature request, marking it as enhancement. @jbkyang-nvi FYI, not the same but this is along the lines of the ask for server timeout in sequence batcher.

Mar 10 '23 23:03 GuanLuo

@GuanLuo @aw1cks any solution for this?

Jan 29 '24 16:01 rizwanishaq

None that I'm aware of, unfortunately

Jan 31 '24 11:01 aw1cks