server icon indicating copy to clipboard operation
server copied to clipboard

Configurable rate-limiting / queue policy for sequence batcher

Open aw1cks opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. As documented here, the dynamic batcher seems to be able to configure a queue policy. There doesn't seem to be any equivalent to this for sequence batching. This means that we cannot disable queueing, which in turn means that busy triton servers can accept requests from clients even when they are unable to perform inference currently - which makes load balancing across multiple triton servers non-trivial.

Describe the solution you'd like Ideally, there would be some configurable options for the sequence batcher to control the queueing behaviour, including disabling queueing and rejecting new connections.

Describe alternatives you've considered

  • gRPC proxy with connection limits
    • It's not always possible to determine a limit ahead of time
    • Doesn't account for channel reuse
  • Retry connection on client side if a certain timeout is exceeded while queueing
    • This introduces latency to inferences

aw1cks avatar Mar 08 '23 12:03 aw1cks

Thanks for the feature request, marking it as enhancement. @jbkyang-nvi FYI, not the same but this is along the lines of the ask for server timeout in sequence batcher.

GuanLuo avatar Mar 10 '23 23:03 GuanLuo

@GuanLuo @aw1cks any solution for this?

rizwanishaq avatar Jan 29 '24 16:01 rizwanishaq

None that I'm aware of, unfortunately

aw1cks avatar Jan 31 '24 11:01 aw1cks