Network calls hang forever
Can anyone confirm what the new behavior of the watch() is in 1.2.0 given this fix: #2367? I'm on 1.0.0 and I'm seeing my watches randomly hang forever, ignoring timeoutInSeconds and everything else. I've seen multiple pods do that at the same time, assuming it's triggered by networking issues. What is the new behavior in that scenario in 1.2.0? Is there now a guarantee that the watch will fail within 30 seconds of such connection breakage? Can I trust it to work now? I almost shipped this to all my customers today, noticed it last moment by pure chance, it happens rarely. @cjihrig @rossanthony @brendandburns
It should break the connection based on #2367 but you should definitely test and validate. Watch behavior on long-hanging connections is very tricky to simulate/test well in general.
Please let us know if you find any issues.
@tokaplan this sounds a bit different to the behavior we were seeing with 1.0.0, what we observed was the informer callback would keep on disconnecting an reconnecting. I think it was timing out after the fetch default timeout of 30 secs. This might not sound like a big deal, but at the time we had it running in various apps, across a few of our non-prod clusters, during a period when we were doing some heavy load testing, therefore we had things scaled out with lots of pods.
One of our systems engineers happened to notice in the EKS control pane dashboard, under a section called "Top Talkers", that the number of watch requests was off the charts. We're talking millions per day - notice the counts in the righthand column...
Since 1.2.0 was cut, I've had it running in an ephemeral environment (scaled out to just 3 pods) and no issues to report based on what I'm seeing thus far. Based on some debug logging I added to our implementation of an informer that watches for changes to the k8s secrets, I can see each pod is making a new watch call once every 30 mins. Which aligns with the change in #2367, specifically the socket keep alive setting here: https://github.com/kubernetes-client/javascript/pull/2367/files#diff-8af1000c89b03ced37f439c61c5696c45e1e83a70cc07182feef6595123f0badR60
Actually the timeout setting added in #2367 was 30000ms (30 secs), with keepAlive probe set with an initial delay also of 30 secs. So I'm not sure what is now causing the watch to reconnect every 30 mins. But anyway it's much better, however the sockets could potentially be tweaked to keep them open longer. Although maybe it's the max idle timeout on the kube API side that is set to 30 mins.
fwiw, often it's some sort of timeout on the long-running TCP connection, or a NAT/Load balancer somewhere in the middle recycling and terminating the TCP connection. long-running connections and networking are really straight-forward in theory and ridiculously convoluted and bespoke in practice.
@rossanthony my watches were correctly disconnecting at timeout (if no timeout is specified it fairly randomly disconnects around 5 minutes, which I assume is the default set on Kube API side), but the problem was that very rarely the call would just hang forever - not receiving any callbacks from the API and not returning. That was completely breaking us of course. I'm just trying to ensure this can't happen anymore, and unless someone can tell me that - I'm either moving away from watches or introducing a watchdog that keeps track of time since last call and crashes the process to have k8s recreate the pod if it detects the zombie state. I don't see any other way, I can't have this zombie state happen randomly.
@tokaplan interesting, I've not observed this behavior of it hanging like that. Does it eventually time out the connection and log an error? Curious how I would tell if we have the same issue. The particular implementation I've been testing in our non-prod env with 1.2.0 is not listening to events very often, because it's a mechanism for caching auth0 tokens in etcd. They get refreshed once per hour, then all the pods running in that particular namespace listen to update events and set the refreshed token in their memory cache. I can try running a sustained load test on it tomorrow and see if I can reproduce what you're describing.
@rossanthony It never times out, it hangs forever. The way I noticed is because we have a watchdog running in a background loop that logs seconds elapsed since our most recent watch has started. It just started climbing. I've seen two pods go into that state at exactly the same time, so there had to be some common reason.
Here's what I'm seeing (1.0.0):
- with 300 pods running (each constantly watching with a 240 second timeout on the watch: one watch ends or fails, the next one immediately begins) - I saw 5 pods (out of 300) hang forever within a span of 48 hours.
- if I change the watch timeout from 240 seconds to 5 seconds - I don't see any pods hang, no repro within 48 hours.
This could be why I've not run into this infinite hanging issue, if it's only happening to ~5 pods out of 300 in a 48hr window. I've been testing it with 3 pods, we don't really have the capacity in our cluster or the need to scale up that much, even in prod we're running ~10 pods in each region.
Have you tried with 1.2.0?
@rossanthony not yet, I need to wait for days to establish base lines. I will soon and report back. Keep in mind though that I did see both pods get into the bad state at the same time - and we only had 2 pods there. So looks like under the right conditions repro rate could be much higher.
@rossanthony so I was able to reproduce and here's what happens. The watch() call does not hang, it immediately returns a Promise<AbortController> as designed. The problem is - if I cut the connection while watch() is doing its thing - it will never trigger the done callback. So since my code is relying on that callback to tell me when the watch is either finished successfully or finished with an error - my code will wait forever. The timeout value will be ignored, the watch won't do anything once it's exceeded.
Seems like a bug to me. Seems like it should complete one way or another, even if only upon reaching the timeout.
The way I'll fix it is by setting up my own timeout with a setTimer and calling abort() on the AbortController. That seems to work, once you do that the callback is finally triggered with the abort error.
In 1.2.0 this does not seem to reproduce; I'm getting either normal watch callback invocation once the timeout is reached (no error, so I guess it still doesn't know something bad happened), or a callback invocation with Error: read ECONNRESET
With 1.2.0 I do see a permanently hanging request though - listClusterCustomObject just hung on me permanently, the call was:
crsResult = <ListResponse>await k8sApi.listClusterCustomObject({
group: K8sWatcher.crdApiGroup,
version: K8sWatcher.crdApiVersion,
plural: K8sWatcher.crdNamePlural,
resourceVersion: latestResourceVersion ?? undefined,
timeoutSeconds: 30
});
I don't know folks, it really seems like there's something fundamentally wrong with the underlying library that handles requests. I've never encountered this in any of other technology stacks I've ever worked with. I now have to surround all my calls with a manual timeout, it's insane.
I'm on AKS if that matters.
@tokaplan this seems a little odd to me because that is just a straight list, not a watch, so it should absolutely complete in less than 30 seconds. Are you getting any data back? Or is it just hanging?
Is there a chance that you have a stale DNS lookup or something?
@brendandburns Yes, it's just a normal list. It hung, no data back after 5 hours, the function never returned.
I'm on a regular AKS cluster, no funny business. I wasn't doing any network manipulation to reproduce anything at the time, no messing around. Whatever happened that caused it - regular Azure customers would see it, too.
I come from a .NET background and have limited experience with NodeJs, but I've never ever in my life seen a request hang forever in .NET, that's for sure.
@tokaplan @brendandburns This sounds like something we have been battling with AKS for years. https://github.com/kubernetes-client/javascript/issues/559#issuecomment-778221716
We used a custom http/2 request to "resolve" this problem, by resolve the problem i mean we detect the connection is broken
That comment is pretty old, here is the latest version we use
class WatchRequest {
webRequest(options) {
const { ca, headers, method, qs, uri } = options;
const url = new URL(uri);
const session = http2.connect(url, { ca });
let ping = null;
let error = '';
session.on('error', err => (error += err));
session.on('close', () => clearInterval(ping));
const stream = session.request(
{
...headers,
':method': method,
':path' : `${url.pathname}?${new URLSearchParams(qs)}`,
accept : 'application/json',
},
{ endStream: false },
);
stream.setEncoding('utf8');
ping = setInterval(() => {
session.ping(error => {
if (error || stream.closed) {
clearInterval(ping);
if (!session.destroyed) {
session.destroy(error || 'stream was closed');
} else {
console.log(`session was already destroyed, ${error || 'stream was closed'}`);
}
}
});
}, 2000);
stream.on('error', () => {
/* no opt this will allow session 'error' to be emitted instead of throwing an exception */
});
stream.on('close', () => {
clearInterval(ping);
session.close();
});
stream.abort = () => {};
return stream;
}
}
@brendandburns We have yet to move to 1.x. Are we still able to inject a custom class to watch?
I just did a quick look and i don't see this option anymore. So it seems we will just need to roll our own Watch type class similar to what i put together in the past that leverage http/2 opposed to node-fetch
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale