Activator continues to route requests to non-existent pods.
What version of Knative?
1.1.4 1.10.6
Expected Behavior
If a pod is not reachable the activator should remove it from the healthy list when serving requests. This allows the activator to not keep a bad pod in the set of healthy pods during instances where endpoint updates may lag behind cluster state or other issues could cause the activators tracked state to be out of date. This would make the activator more robust to these sort of issues in the sense that it would not try to route to something it cannot connect to. (This of course does not address things like potentially being unaware of new pods etc.)
NOTE: I took a quick look to see if there were changes in newer versions that may address this, I did not see any changes directly in this codebase.
Actual Behavior
The pod is only removed if the endpoint list updates the status of the pod or the pod is added/removed. In instances where a watch may "go bad" and miss updates this can perpetuate the activator continuing the serve to pods that should be removed from the healthy set.
For instance this particular case shows the behavior difference well:
Loki query with added logging showing when func (rbm *revisionBackendsManager) endpointsUpdated() is called:
Note the one activator suddenly has less updates at 4:40.
The corresponding logs showing "dial-errors" during proxying request from logs with message "error reverse proxying request" along with a dial error in error:
The activator that shows the drop in updates also observes an increasing dial error rate around that time. At some point the rest of the activators then also show similar behavior (but watch updates are not as easily an indicator at that point since the whole fleet looks similar).
The dial errors broken down by type:
This shows dial timeout being the most often returned error in this case with no connection to host second. Con-refused could be unrelated but is also an indication of an unusable endpoint.
When inspecting the logs and correlating the pod IPs to actual pods, most of these timeout dial errors align with the activator attempting to use a pod that no longer exists. This can persist for minutes in some cases (was happening in this particular graphed case). It is not restricted to just one revision or kservice. In most cases if the activator removed these pods with dial errors from the healthy pool it would at least prevent it from routing to those continually until its internal endpoint state had gotten back into sync with the cluster (if it ever does...).
In these situations restarting the activator resolves the issues almost all the time. Which is expected since the endpoint state which appears to be the cause of getting into this state is re-acquired fully with as list before watching again.
Steps to Reproduce the Problem
We have noticed this primarily when it seems watch updates have been missed via some mechanism that I am not 100% sure on, could be network, server, or client side issues. This should be testable/mockable by preventing updates via the endpoint informer.
- Scale a knative service, and verify backends working
- Prevent endpoint updates from reaching processing code in activator
- Wait for pods to scale down
- Try using service and observe activator attempting to route to pods that no longer exist continually
Alternatively it may be possible to test using some other mechanism to slow down endpoint updates such that they lag actual cluster state enough allow seeing the behavior.