Configurable rate-limiting / queue policy for sequence batcher
Is your feature request related to a problem? Please describe. As documented here, the dynamic batcher seems to be able to configure a queue policy. There doesn't seem to be any equivalent to this for sequence batching. This means that we cannot disable queueing, which in turn means that busy triton servers can accept requests from clients even when they are unable to perform inference currently - which makes load balancing across multiple triton servers non-trivial.
Describe the solution you'd like Ideally, there would be some configurable options for the sequence batcher to control the queueing behaviour, including disabling queueing and rejecting new connections.
Describe alternatives you've considered
- gRPC proxy with connection limits
- It's not always possible to determine a limit ahead of time
- Doesn't account for channel reuse
- Retry connection on client side if a certain timeout is exceeded while queueing
- This introduces latency to inferences
Thanks for the feature request, marking it as enhancement. @jbkyang-nvi FYI, not the same but this is along the lines of the ask for server timeout in sequence batcher.
@GuanLuo @aw1cks any solution for this?
None that I'm aware of, unfortunately