False update events from informer after upgrading from 0.22 to 1.2.0 on AKS clusters
Describe the bug
We noticed that after migrating from version 0.22 to 1.2.0 (and also in 1.1.2), we began receiving false update events from the informer event listener. The object hasn't changed in etcd, yet these events appear to occur approximately every 40 minutes. This behavior did not occur in version 0.22—we verified this through the logs in Datadog.
datadog logs
To further confirm the issue, we ran the following command:
kubectl get githubmembership juanjosevazquezgil-68512dbc-257d-48f1-a53a-055f94b8ab7f -n firestartr-github --show-managed-fields=true -o yaml --output-watch-events=true --watch
During the same time period, no changes to the object were shown.
We're deploying the same version of our service in an Amazon EKS cluster, where this behavior does not occur.
Additionally, we use a Kind cluster for local development, and we haven’t observed this issue there either.
Client Version
1.2.0
Server Version
1.30.11
Example Code
import { getConnection } from './ctl';
import Debug from 'debug';
import common from 'catalog_common';
const log = Debug('firestartr:operator:reflector');
import * as k8s from '@kubernetes/client-node';
import { Store } from './store';
/**
* Execute the callbacks for each item in the store
* @param {string} plural - Kind to observe
* @param {string} namespace - Namespace to observe
* @param {Function} onAdd - Function to call when an item is added
* @param {Function} onChange - Function to call when an item is modified
* @param {Function} onDelete - Function to call when an item is deleted
* @param {Function} onRename - Function to call when an item is renamed
*/
export async function observe(
plural: string,
namespace: string,
onAdd: Function,
onChange: Function,
onDelete: Function,
_onRename: Function,
) {
const store = new Store(plural);
const { kc } = await getConnection();
try {
const k8sApi = kc.makeApiClient(k8s.CustomObjectsApi);
const apiGroup = 'firestartr.dev';
const apiVersion = 'v1';
const apiPaths = `/apis/${apiGroup}/${apiVersion}/namespaces/${namespace}/${plural}`;
const listFn = () => {
return k8sApi.listNamespacedCustomObject({
group: apiGroup,
version: apiVersion,
namespace,
plural,
});
};
const informer = k8s.makeInformer(
kc,
apiPaths,
listFn as k8s.ListPromise<k8s.KubernetesObject>,
);
informer.on(
'add',
(obj: any) => {
store.add(obj);
if (store.hasDeletionTimestamp(obj)) {
log(
`informer (on add): The item ${obj.metadata.name} has a deletion timestamp, marking for deletion`,
);
store.markToDelete(obj);
onDelete(obj);
} else {
log(
`informer (on add): The item ${obj.metadata.name} has been added`,
);
onAdd(obj);
}
},
);
informer.on(
'update',
(obj: any) => {
log(
`informer (on update): The item ${obj.kind}: ${obj.metadata.name} has been updated`,
);
if (
!store.getItem(obj).markedToDelete &&
store.hasDeletionTimestamp(obj) &&
(store.hasBeenMarkedToDelete(obj) || store.modified(obj))
) {
log(
`informer (on update): The item ${obj.kind}: ${obj.metadata.name} has a deletion timestamp, marking for deletion`,
);
store.markToDelete(obj);
onDelete(obj);
} else if (store.modified(obj)) {
log(
`informer (on update): The item ${obj.kind}: ${obj.metadata.name} has been modified`,
);
onChange(obj);
}
},
);
informer.on(
'delete',
(obj: any) => {
log(
`informer (on delete): The item ${obj.metadata.name} :${obj.metadata.name} has been deleted from etcd`,
);
store.remove(obj);
},
);
informer.on(
'error',
(err: any) => {
log('informer (on error): ERROR %O', err);
setTimeout(
() => {
void informer.start();
},
5000,
);
},
);
void informer.start();
} catch (err) {
console.error(`Observing: ${plural}: ${err}`);
common.io.writeFunctionLog(
'observer',
`Error Observing: ${plural}: ${err}`,
);
throw `Observing: ${plural}: ${err}`;
}
}
Environment (please complete the following information):
- OS: [Linux]
- Node.js version [20]
- Cloud runtime [Azure Aks v1.30.11]
Not sure if it can be related anyhow to this commit. Don't seem to be much changes going on there.
So my guess is that this is happening on re-list when the connection drops and there is a reconnect. After we list, we do test to make sure we haven't already seen the same resourceVersion:
https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L303
It would be interesting if you could log the resourceVersion for the updated resource and see if it has in fact changed, or if it is the same.
So my guess is that this is happening on re-list when the connection drops and there is a reconnect. After we list, we do test to make sure we haven't already seen the same resourceVersion:
https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L303
It would be interesting if you could log the resourceVersion for the updated resource and see if it has in fact changed, or if it is the same.
I logged the source version , it is triggering a set of events with previous versions of the object
i will log more details if you need
So, from looking at the code I'm really not sure how this could happen unless the entire informer is being recreated somehow. I wonder if the onError handler is being called multiple times which would then create multiple informers.
We always call addOrUpdateObject when we see a resource, and that always checks !isSameVersion
https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L204
And that code checks the resource version for equality: https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L30
Sorry for intruding, I'm a @tonirvega workmate.
We don't see this behavior with 0.22.3, and we've been using it on AKS for the past few months until we refactored for ESM.
Maybe it has nothing to do with it, but reviewing the diff regarding the cache.ts file, between 1.1.2 and 0.22.3, I missed some commits that were in 0.2.3 and that I don't see in 1.1.2 (maybe now they don't apply?).
Specially https://github.com/kubernetes-client/javascript/commit/d12ec2629a61da2e1cbde81113021c4bb38bfb25, that comes from https://github.com/kubernetes-client/javascript/pull/781, but also https://github.com/kubernetes-client/javascript/commit/eb4d246a93f373d51b109bc122120514e5bd6c93, from https://github.com/kubernetes-client/javascript/pull/965.
@alambike thanks for the pointers. I will try cherry-picking those forward into the 1.x series and we can test.
So I think I figured out part of the problem. On relist, we reset the resource version to the resource version in the list:
https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L165
But when I tested against a real server, that resource version is always empty, so we probably shouldn't do that. I still don't see how you get duplicate events, but I'm working on improving the tests.
I can confirm this is not happening in the 0.22.3, no duplicated events for the same resource version
Just to add a note that we're also experiencing this, and from what I can see it does seem that when the informer disconnects and reconnects again, we're receiving multiple older versions of the resource that our informer is then handling. Did not have this issue prior to upgrading to v1
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Just to add a note that we're also experiencing this, and from what I can see it does seem that when the informer disconnects and reconnects again, we're receiving multiple older versions of the resource that our informer is then handling. Did not have this issue prior to upgrading to v1
No, in previous 0.22.x version we do not experience this issue. We still need to validate this again in the latest v1 versions, as it seems that changes have been introduced to the watch feature since this version 1.2.0.