Load balancer gives back same instance under concurrent conditions
Please correct me if I'm missing something...
Based on looking at the code for both LoadBalancerClientFilter and ReactiveLoadBalancerClientFilter and the LoadBalancer implementations used for each (RibbonLoadBalancerClient and RoundRobbinLoadBalancerClient), I believe that there may be a subtle concurrency quirk that users might find surprising at first.
Based upon each of the LoadBalancerClients coming from bean definitions created within an application context specific for that service, we can pretty safely say that the LoadBalancerClient is a singleton within a service boundary. With that in place, if there is sufficient concurrency occurring through that bean in combination with the Retry filter, it's possible that a request is routed back to the same instance that it has just failed on. I would think from a user standpoint, that most developers would expect to be retried on the other instance as that's what happens under the low concurrency model [as a result of the load balancer having a round robbin strategy by default].
It'd be good to either document this as an edge case or keep the instance of the load balancer within these two filters, so that the positional instance selection can be maintained allowing for a retry on next server type of interaction.
I'll be honest, that it's a bit difficult in order to specifically recreate this issue with a demo project due to the concurrent nature, but if that's necessary then I can attempt to put something together.
It's not just a function of concurrency but the number of instances as well. the load balancer instance is a singleton per service so I don't think hanging on to it in the filter will make much difference.
Random also won't work well for n=2
Correct, it is a matter of the number of instances. In my situation I've got services with 2 instances, so it comes out as a 50/50 split. With sufficient concurrency, even larger clusters would have similar situations, but it'd just be less and less likely as the size of the cluster increases. Probability is an equation of 1 / (n servers) = % probability.
Also correct, that since the load balancer is holding onto it's own state and that the load balancer is also a singleton keeping a reference to it doesn't really help either. I know with Zuul, it creates a unique load balancer instance per request with the state of the current position, then the client can retry on the next server within their isolated load balancer.
Which is #1370, right?
Noticed this while coming up with the details around #1370, so they're definitely related. With 1370, it's a matter of being able to evict a server instance because it has been failing. Maybe as a result of a network segmentation where the gateway isn't able to contact it, the service instance is throwing errors because it may not be able to contact a resource it needs, etc (insert other failure modes).
While this ticket is around that same concept, but is from the standpoint of a request and the end user expectations (in this case developer) around what instance would be routed to in the event of a retry. My thought process here was less on the 500 server error side of the coin, but a service not responding back quickly enough and we hit the response timeout and (expected to) retry on the next server.
Secondary item that was noticed here is that with the default retry policy, it does in fact retry on all requests that encountered a timeout irrespective of HTTP method. Documentation seemed to state that any request outside of GET wouldn't be retried on exceptional case. I think it's pretty dangerous to be retrying POST,PUT,PATCH, etc requests on a simple timeout. (prime use case is to have response-timeout set to a low value and use a service that responds more slowly than that sometimes). Though to be honest, this part is probably worthy of a whole separate ticket entirely.
If you all would prefer to collapse the two down into one ticket, that's fine by me. I just happen to see them as two facets to load balancing instances.
it shouldn't retry anything but GET by default. Retry in gateway doesn't distinguish between a 500 or a timeout.
zuul uses a similar mechanism for caching the loadbalancer client as gateway so I'm unsure what happens there besides the ribbon metrics.
it shouldn't retry anything but GET by default. Retry in gateway doesn't distinguish between a 500 or a timeout.
Here's a sample showing this.
https://github.com/shanman190/spring-cloud-gateway-gh1371
Please open a separate issue for retry method
I wonder if we could add some simple logic to the load balancer that queries the load balancer until we get a new instance if it gave us the instance we just tried (obviously with some conditions around this like we only do this if there is more than 1 instance and we only do it a couple times).
I think that could work as a short term, but in all likeliness it's might be easier to treat/implement the LoadBalanacer returned, or have create a LoadBalancer, as stateful one that way it can more easily avoid the instance out of the gate. By remaining stateless and not implementing a fair concurrency model could result in a lot of unnecessary spinning which would just impact the performance gains by using the reactive context beneath SCG without short circuits like you mentioned.
To be honest this could even be a wrapped LoadBalancer around what is provided from SC loadbalancer purpose driven for the semantics of SCG. That would keep the code changes isolated to SCG if loadbalancer is intended to stay stateless, albeit the decision to take that direction might be odd given that a user may have the expectation for the general purpose LoadBalancer to be stateful in it's usage unless utilized as a singleton.
This is somewhat related to spring-cloud/spring-cloud-commons#659 , but there I've been rather thinking of retrying on any instance other than the one chosen previously and not recording stats in the long run, but if there are enough compelling arguments, it might be better to go with the stats; @spencergibb @ryanjbaxter wdyt? Then we would handle it in the LoadBalancer (only for SC LoadBalancer, though) and can move this issue then.
Wouldn't Ribbon use something like a ping url to ping the server to determine whether it is up or not? Couldnt something like that solve this issue?
It could do that. In the Netflix ribbon code it did have a ping check, but I believe it was not the default. I have a suspicion that probably relates pretty directly to the organic vs synthetic metrics debate. Not to mention the cascading failure problem. (Immediate server we are talking to is "up", but the next guy down the chain isn't and the immediate isn't shielding us as the consumer from those errors)
The other part if I recall my conversation with one of their devs was that in some cases, they would use metrics from their monitoring tool as an indication of the target server instance was possibly not the best to send the request to and instead route it to a different instance that had less active requests.
@ryanjbaxter @shanman190 I'm now working on adding a ping verification mechanism issue. This could be used here. However, indeed, it will not give us as broad as understanding as a stats mechanism and I'm not sure it would add any value at all, as the current, base LoadBalancer implementation already retrieves instances from service discovery each time an instance is being chosen (that does not happen in Ribbon).
Another issue for adding stats has been added. We now need to discuss it in the team. As there are more questions appearing about that, we might add it in the backlog, and then use it here.
@shanman190 You can see if spring-cloud-circuitbreaker-instances can solve your problem
Use InstancesCircuitBreaker to fuse a single instance, cooperate with retry, and if it is necessary to retry in case of fuse, can this problem be solved
Hi @shanman190, I recently faced this issue, and here's what I did. The circuit breaker doesn't use other instances if one of them is down. It is implemented this way. But retry works slightly differently. It does allow us to redirect to another instance if one of them is not functioning correctly. I switched the aspect order of the circuitbreaker and retry, and provided a fallback to circuitbreaker, and it solved my problem. However, one thing to note here is that the failed attempt by retry will be considered as failed attempt for the circuitbreaker as well!! Code changes:
resilience4j.circuitbreaker.circuitBreakerAspectOrder=1
resilience4j.retry.retryAspectOrder=2
AND
@Retry(name = "demo-retry-config")
@CircuitBreaker(name = "demo-circuitbreaker-config", fallbackMethod = "fallbackDemo1Hello")
I hope this will resolve your issue!!!