javascript icon indicating copy to clipboard operation
javascript copied to clipboard

False update events from informer after upgrading from 0.22 to 1.2.0 on AKS clusters

Open tonirvega opened this issue 8 months ago • 9 comments

Describe the bug

We noticed that after migrating from version 0.22 to 1.2.0 (and also in 1.1.2), we began receiving false update events from the informer event listener. The object hasn't changed in etcd, yet these events appear to occur approximately every 40 minutes. This behavior did not occur in version 0.22—we verified this through the logs in Datadog.

datadog logs

Image

Image

To further confirm the issue, we ran the following command:

kubectl get githubmembership juanjosevazquezgil-68512dbc-257d-48f1-a53a-055f94b8ab7f -n firestartr-github --show-managed-fields=true -o yaml --output-watch-events=true --watch

During the same time period, no changes to the object were shown.

We're deploying the same version of our service in an Amazon EKS cluster, where this behavior does not occur.

Additionally, we use a Kind cluster for local development, and we haven’t observed this issue there either.

Client Version 1.2.0

Server Version 1.30.11

Example Code

import { getConnection } from './ctl';
import Debug from 'debug';
import common from 'catalog_common';
const log = Debug('firestartr:operator:reflector');
import * as k8s from '@kubernetes/client-node';
import { Store } from './store';

/**
 * Execute the callbacks for each item in the store
 * @param {string} plural - Kind to observe
 * @param {string} namespace - Namespace to observe
 * @param {Function} onAdd - Function to call when an item is added
 * @param {Function} onChange - Function to call when an item is modified
 * @param {Function} onDelete - Function to call when an item is deleted
 * @param {Function} onRename - Function to call when an item is renamed
 */
export async function observe(
  plural: string,
  namespace: string,
  onAdd: Function,
  onChange: Function,
  onDelete: Function,
  _onRename: Function,
) {
  const store = new Store(plural);

  const { kc } = await getConnection();

  try {
    const k8sApi = kc.makeApiClient(k8s.CustomObjectsApi);

    const apiGroup = 'firestartr.dev';

    const apiVersion = 'v1';

    const apiPaths = `/apis/${apiGroup}/${apiVersion}/namespaces/${namespace}/${plural}`;

    const listFn = () => {
      return k8sApi.listNamespacedCustomObject({
        group: apiGroup,
        version: apiVersion,
        namespace,
        plural,
      });
    };

    const informer = k8s.makeInformer(
      kc,
      apiPaths,
      listFn as k8s.ListPromise<k8s.KubernetesObject>,
    );

    informer.on(
      'add',

      (obj: any) => {
        store.add(obj);

        if (store.hasDeletionTimestamp(obj)) {
          log(
            `informer (on add): The item ${obj.metadata.name} has a deletion timestamp, marking for deletion`,
          );

          store.markToDelete(obj);

          onDelete(obj);
        } else {
          log(
            `informer (on add): The item ${obj.metadata.name} has been added`,
          );

          onAdd(obj);
        }
      },
    );

    informer.on(
      'update',

      (obj: any) => {
        log(
          `informer (on update): The item ${obj.kind}: ${obj.metadata.name} has been updated`,
        );

        if (
          !store.getItem(obj).markedToDelete &&
          store.hasDeletionTimestamp(obj) &&
          (store.hasBeenMarkedToDelete(obj) || store.modified(obj))
        ) {
          log(
            `informer (on update): The item ${obj.kind}: ${obj.metadata.name} has a deletion timestamp, marking for deletion`,
          );

          store.markToDelete(obj);

          onDelete(obj);
        } else if (store.modified(obj)) {
          log(
            `informer (on update): The item ${obj.kind}: ${obj.metadata.name} has been modified`,
          );

          onChange(obj);
        }
      },
    );

    informer.on(
      'delete',

      (obj: any) => {
        log(
          `informer (on delete): The item ${obj.metadata.name} :${obj.metadata.name} has been deleted from etcd`,
        );

        store.remove(obj);
      },
    );

    informer.on(
      'error',

      (err: any) => {
        log('informer (on error): ERROR %O', err);

        setTimeout(
          () => {
            void informer.start();
          },

          5000,
        );
      },
    );

    void informer.start();
  } catch (err) {
    console.error(`Observing: ${plural}: ${err}`);

    common.io.writeFunctionLog(
      'observer',
      `Error Observing: ${plural}: ${err}`,
    );

    throw `Observing: ${plural}: ${err}`;
  }
}

Environment (please complete the following information):

  • OS: [Linux]
  • Node.js version [20]
  • Cloud runtime [Azure Aks v1.30.11]

tonirvega avatar May 07 '25 16:05 tonirvega

Not sure if it can be related anyhow to this commit. Don't seem to be much changes going on there.

jalvarezit avatar May 07 '25 16:05 jalvarezit

So my guess is that this is happening on re-list when the connection drops and there is a reconnect. After we list, we do test to make sure we haven't already seen the same resourceVersion:

https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L303

It would be interesting if you could log the resourceVersion for the updated resource and see if it has in fact changed, or if it is the same.

brendandburns avatar May 07 '25 16:05 brendandburns

So my guess is that this is happening on re-list when the connection drops and there is a reconnect. After we list, we do test to make sure we haven't already seen the same resourceVersion:

https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L303

It would be interesting if you could log the resourceVersion for the updated resource and see if it has in fact changed, or if it is the same.

I logged the source version , it is triggering a set of events with previous versions of the object

Image

i will log more details if you need

tonirvega avatar May 07 '25 18:05 tonirvega

So, from looking at the code I'm really not sure how this could happen unless the entire informer is being recreated somehow. I wonder if the onError handler is being called multiple times which would then create multiple informers.

We always call addOrUpdateObject when we see a resource, and that always checks !isSameVersion

https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L204

And that code checks the resource version for equality: https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L30

brendandburns avatar May 08 '25 15:05 brendandburns

Sorry for intruding, I'm a @tonirvega workmate.

We don't see this behavior with 0.22.3, and we've been using it on AKS for the past few months until we refactored for ESM.

Maybe it has nothing to do with it, but reviewing the diff regarding the cache.ts file, between 1.1.2 and 0.22.3, I missed some commits that were in 0.2.3 and that I don't see in 1.1.2 (maybe now they don't apply?).

Specially https://github.com/kubernetes-client/javascript/commit/d12ec2629a61da2e1cbde81113021c4bb38bfb25, that comes from https://github.com/kubernetes-client/javascript/pull/781, but also https://github.com/kubernetes-client/javascript/commit/eb4d246a93f373d51b109bc122120514e5bd6c93, from https://github.com/kubernetes-client/javascript/pull/965.

alambike avatar May 08 '25 18:05 alambike

@alambike thanks for the pointers. I will try cherry-picking those forward into the 1.x series and we can test.

brendandburns avatar May 09 '25 14:05 brendandburns

So I think I figured out part of the problem. On relist, we reset the resource version to the resource version in the list:

https://github.com/kubernetes-client/javascript/blob/main/src/cache.ts#L165

But when I tested against a real server, that resource version is always empty, so we probably shouldn't do that. I still don't see how you get duplicate events, but I'm working on improving the tests.

brendandburns avatar May 09 '25 16:05 brendandburns

I can confirm this is not happening in the 0.22.3, no duplicated events for the same resource version

Image

tonirvega avatar May 26 '25 07:05 tonirvega

Just to add a note that we're also experiencing this, and from what I can see it does seem that when the informer disconnects and reconnects again, we're receiving multiple older versions of the resource that our informer is then handling. Did not have this issue prior to upgrading to v1

AlexC avatar Jun 13 '25 07:06 AlexC

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 11 '25 08:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 11 '25 09:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 10 '25 10:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 10 '25 10:11 k8s-ci-robot

Just to add a note that we're also experiencing this, and from what I can see it does seem that when the informer disconnects and reconnects again, we're receiving multiple older versions of the resource that our informer is then handling. Did not have this issue prior to upgrading to v1

No, in previous 0.22.x version we do not experience this issue. We still need to validate this again in the latest v1 versions, as it seems that changes have been introduced to the watch feature since this version 1.2.0.

alambike avatar Nov 10 '25 11:11 alambike