gPRC: some streams get stuck when there are many long-running streams
Describe the bug
I am measuring the performance of reverse-proxy for gRPC, using the test assets from grpc-dotnet. It turns out that when there are many long-running gRPC streams (I'm using 10 clients * 64 connections * 100 streams), some streams are permanently stuck while others work well.
Specifically, there is 10 gRPC clients and 1 gRPC server. Each client keeps 64 connections * 100 long-running streams to the server, through yarp. In each stream, they communicate in a ping-pong manner: the client sends a frame, the server sends a frame, the client sends a frame, the server sends a frame, and so on...
Finally, the throughput between the the proxy server handles 38k frames per second. That's good. But the problem is that in my 1min experiment, several streams manage to send thousands of frames, while the other are always waiting for the first response frame - that is, most streams are totally stuck, while yarp is always processing the frames from several streams.
More specifically, I think in this scenario, these frames should be processed in an approximate FIFO manner - a frame should not be waiting for a very long time while there are very many frames (after it, but from other streams) being processed.
Further experiment
I tried to run a long experiment that lasts 6000s. It turns out most streams are finally processed but with very imbalanced throughput. There are over 50k requests on some connections but only hundreds of requests on some other. For the poorest connection, there are only 100 requests in the 1 hour - all the streams in the connection are totally stuck.
To Reproduce
I am running the proxy-yarp-grpc scenario from https://github.com/aspnet/Benchmarks/blob/main/scenarios/proxy.grpc.benchmarks.yml, using crank.
Further technical details
- The platform: Linux
My Hypothesis
For now, yarp seems to process requests in the order of requests. Though, it may be a complete stream that is viewed as a "request" from the proxy side. So, yarp may put all efforts on several streams because it thinks it's processing a "request", while leaving all other "requests" untouched. Though, these "requests" will never complete because they are actually long-running gRPC streams, and thus yarp will never handle other streams if some streams are busy enough.
@davidfowl @MihaZupan @JamesNK
We should run the gRPC benchmark without YARP as the next step, and if the problem still occurs, then this issue should be transferred to aspnetcore for further investigation.
My guess is 100 max concurrent streams limit is hit. See https://docs.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-6.0#connection-concurrency
@JamesNK In my opinion, the 100 concurrent stream limit is set per connection. But it seems there is no limit of connections.
@adityamandaleeka I did the experiments without YARP. It turns out that without YARP, this problem does not happen. All the frames get response with a reasonable latency.
Thanks @SleepyBag. We'll try to investigate this.