azure-sdk-for-java icon indicating copy to clipboard operation
azure-sdk-for-java copied to clipboard

Re-evaluate use of `repeatWhenEmpty()` in Azure Core.

Open vcolin7 opened this issue 3 years ago • 0 comments

Issues #30466 and #28364 demonstrated that, when sharing a client among multiple threads, all of them will also attempt to access the same access token cache when performing any operations that communicate with the service. Assuming there is no access token in the cache or the token it contains has expired, the next attempt to read it will trigger a request to obtain a new one from the authentication service and, if multiple threads attempt to do so "simultaneously", the first one to get ahold of the cache will be the one to communicate with the service while all others wait for it to finish, to then access the token in a synchronized fashion.

The way this mechanism is set up puts all waiting threads in an async-busy-loop, which would normally not be a problem if the duration of token retrieval operation was guaranteed to be very short, however, this is not the case when dealing with network operations, as sometimes the service can take more than a few seconds to respond or the connection can be dropped in the middle of the process, to name a few problems. This means that if the retrieval operation performed by the first thread takes a long time, it is possible to have an application with a non-insignificant number of threads in a waiting state to consume virtually all CPU cycles in a machine or container with limited resources, such as a Docker container or a virtual machine.

A similar state can be caused when sharing a client among multiple threads and performing long-running operations, as PollerFlux also makes use of Mono.repeatWhenEmpty(), as seen here. An async-busy-loop is less likely to occur in this scenario as an instance of PollerFlux is rarely shared among threads, but it is certainly a possibility.

We should re-visit the use of this mechanism and evaluate if there are better ways to avoid this problem than to add a delay as done here.

vcolin7 avatar Sep 23 '22 01:09 vcolin7