Service endpoints are not updated / removed after upgrade to Kubernetes 1.28
What version of Knative?
0.15.2
Expected Behavior
endpoints should update properly
Actual Behavior
Endpoints for a service are not getting updated on scale down operation or pod deletes. This leaves a lot of incorrect values in the endpoints. The propagates to the public service as well.
% kubectl -n detection get endpoints my-app-00112-private
NAME ENDPOINTS AGE
my-app-00112-private 10.32.101.40:9091,10.32.101.41:9091,10.32.101.43:9091 + 5997 more... 136m
% kubectl -n detection get deploy my-app-00112-deployment
NAME READY UP-TO-DATE AVAILABLE AGE
my-app-00112-deployment 2/2 2 2 136m
I was able to get logs like this from SKS:
{
apiVersion: "v1"
eventTime: null
involvedObject: {
apiVersion: "networking.internal.knative.dev/v1alpha1"
kind: "ServerlessService"
name: "my-app-00112"
namespace: "detection"
resourceVersion: "6779758389"
uid: "f6ed0598-0171-43ff-bf7a-c45069fdcbe2"
}
kind: "Event"
lastTimestamp: "2024-09-14T15:38:13Z"
message: "SKS: my-app-00112 does not own Service: my-app-00112-private"
metadata: {
creationTimestamp: "2024-09-14T15:38:13Z"
managedFields: [1]
name: "my-app-00112.17f5266fbfda92c2"
namespace: "detection"
resourceVersion: "3317050884"
uid: "20dcc671-4abb-490c-aff8-7404dfdf8063"
}
reason: "InternalError"
reportingComponent: "serverlessservice-controller"
reportingInstance: ""
source: {
component: "serverlessservice-controller"
}
type: "Warning"
}
logName: "projects/my-project-92384924/logs/events"
receiveTimestamp: "2024-09-14T15:38:13.778779952Z"
resource: {
labels: {
cluster_name: "my-cluster-192132"
location: "us-central1-c"
project_id: "my-project-92384924"
}
type: "k8s_cluster"
}
severity: "WARNING"
timestamp: "2024-09-14T15:38:13Z"
}
Steps to Reproduce the Problem
This happens with all our ksvc that scale up and then down or have pods removed (via delete / evict).
I'm pretty sure this is an upstream bug, and have opened this: https://github.com/kubernetes/kubernetes/issues/127370
In the SKS update process, it is the private service Endpoints that are feeding SKS. Is there any plan to read from EnpointSlices (stable since 1.21) and move away from the legacy Endpoints? From the docs:
The EndpointSlice API is the recommended replacement for Endpoints.
Yepp, seems like the upstream issue, so not much we can do here. For EndpointSlices check the discussion here.
Upstream fix: https://github.com/kubernetes/kubernetes/pull/127417
We've just been affected by this in our environment on knative 1.16 in Google Cloud - for reference for people experiencing this in GKE, although the current stable channel is 1.30.5, it is 1.30.6 and above that contains the fix.
(and can confirm that once the fix is in, the endpoints behave normally again)
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/remove-lifecycle stale
Closing this out as the upstream fix is out