serving Service endpoints are not updated / removed after upgrade to Kubernetes 1.28

What version of Knative?

0.15.2

Expected Behavior

endpoints should update properly

Actual Behavior

Endpoints for a service are not getting updated on scale down operation or pod deletes. This leaves a lot of incorrect values in the endpoints. The propagates to the public service as well.

% kubectl -n detection get endpoints my-app-00112-private
NAME                      ENDPOINTS                                                              AGE
my-app-00112-private   10.32.101.40:9091,10.32.101.41:9091,10.32.101.43:9091 + 5997 more...   136m

% kubectl -n detection get deploy my-app-00112-deployment
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
my-app-00112-deployment   2/2     2            2           136m

I was able to get logs like this from SKS:

{
apiVersion: "v1"
eventTime: null
involvedObject: {
apiVersion: "networking.internal.knative.dev/v1alpha1"
kind: "ServerlessService"
name: "my-app-00112"
namespace: "detection"
resourceVersion: "6779758389"
uid: "f6ed0598-0171-43ff-bf7a-c45069fdcbe2"
}
kind: "Event"
lastTimestamp: "2024-09-14T15:38:13Z"
message: "SKS: my-app-00112 does not own Service: my-app-00112-private"
metadata: {
creationTimestamp: "2024-09-14T15:38:13Z"
managedFields: [1]
name: "my-app-00112.17f5266fbfda92c2"
namespace: "detection"
resourceVersion: "3317050884"
uid: "20dcc671-4abb-490c-aff8-7404dfdf8063"
}
reason: "InternalError"
reportingComponent: "serverlessservice-controller"
reportingInstance: ""
source: {
component: "serverlessservice-controller"
}
type: "Warning"
}
logName: "projects/my-project-92384924/logs/events"
receiveTimestamp: "2024-09-14T15:38:13.778779952Z"
resource: {
labels: {
cluster_name: "my-cluster-192132"
location: "us-central1-c"
project_id: "my-project-92384924"
}
type: "k8s_cluster"
}
severity: "WARNING"
timestamp: "2024-09-14T15:38:13Z"
}

Steps to Reproduce the Problem

This happens with all our ksvc that scale up and then down or have pods removed (via delete / evict).

Sep 14 '24 18:09 mbrancato

I'm pretty sure this is an upstream bug, and have opened this: https://github.com/kubernetes/kubernetes/issues/127370

In the SKS update process, it is the private service Endpoints that are feeding SKS. Is there any plan to read from EnpointSlices (stable since 1.21) and move away from the legacy Endpoints? From the docs:

The EndpointSlice API is the recommended replacement for Endpoints.

Sep 15 '24 04:09 mbrancato

Yepp, seems like the upstream issue, so not much we can do here. For EndpointSlices check the discussion here.

Sep 17 '24 13:09 ReToCode

move away from the legacy Endpoints?

Pls check discussion here.

Sep 24 '24 07:09 skonto

Upstream fix: https://github.com/kubernetes/kubernetes/pull/127417

Sep 24 '24 11:09 mbrancato

We've just been affected by this in our environment on knative 1.16 in Google Cloud - for reference for people experiencing this in GKE, although the current stable channel is 1.30.5, it is 1.30.6 and above that contains the fix.

(and can confirm that once the fix is in, the endpoints behave normally again)

Nov 15 '24 14:11 DavidR91

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Feb 14 '25 01:02 github-actions[bot]

/remove-lifecycle stale

Feb 17 '25 12:02 skonto

Closing this out as the upstream fix is out

Apr 14 '25 01:04 dprotaso