Add progress deadline support for SDeps
What this PR does / why we need it:
Currently, Seldon Deployments do not support setting progressDeadlineSeconds, which is an optional setting used to control when a deployment is marked as failed.
This field is defined for deployments, not pods or containers, thus users cannot set this manually by overriding component specs in an SDep.
As progressDeadlineSeconds is optional in Kubernetes, with a default of 600 seconds, this change is backwards compatible -- any SDep not specifying this will use the Kubernetes default automatically.
Which issue(s) this PR fixes:
Fixes #4224
Special notes for your reviewer:
Testing these changes was a bit of a convoluted process but did confirm the intended behaviour.
For reference, I did the following:
- Created a fresh kind cluster called
ansible(using ansible-k8s-collection) - Generated new CRDs and a corresponding controller/operator image
$ make -C operator/ manifests
$ make -C operator/ kind-image-install
- Ensured the Seldon Core Helm charts were up to date locally
$ make -C operator/helm/ create
- Released the local Helm chart
$ helm upgrade --install seldon-core ./helm-charts/seldon-core-operator/ -n seldon-system
- Modified the
mean_classifierexample to delay for a long time (1000 seconds) in its__init__()call
$ git diff -U1 examples/models/mean_classifier/MeanClassifier.py
diff --git a/examples/models/mean_classifier/MeanClassifier.py b/examples/models/mean_classifier/MeanClassifier.py
index f3bbf2421..f899b0d39 100644
--- a/examples/models/mean_classifier/MeanClassifier.py
+++ b/examples/models/mean_classifier/MeanClassifier.py
@@ -10,3 +10,3 @@ class MeanClassifier(object):
- def __init__(self, intValue=0, delaySecs=0):
+ def __init__(self, intValue=0, delaySecs=1000):
self.class_names = ["proba"]
@@ -21,2 +21,4 @@ class MeanClassifier(object):
logging.info("loading model here")
+ time.sleep(self.delay_secs)
+ logging.info("loaded model after delay")
- Built and loaded this modified model into my kind cluster
$ git diff -U0 examples/models/mean_classifier/Makefile
diff --git a/examples/models/mean_classifier/Makefile b/examples/models/mean_classifier/Makefile
index 5d620a154..173c36b75 100644
--- a/examples/models/mean_classifier/Makefile
+++ b/examples/models/mean_classifier/Makefile
@@ -17 +17 @@ kind_load: build
- kind load -v 3 docker-image ${IMAGE_BASE}:${VERSION}
+ kind load -v 3 docker-image ${IMAGE_BASE}:${VERSION} --name=ansible
$ make -C examples/models/mean_classifier/ kind_load
s2i build -E environment . seldonio/seldon-core-s2i-python38-ubi8:1.15.0-dev seldonio/mock_classifier:1.15.0-dev
...
kind load -v 3 docker-image seldonio/mock_classifier:1.15.0-dev --name=ansible
Image: "seldonio/mock_classifier:1.15.0-dev" with ID "sha256:7dbdc1af0116fa15612f5027c677440b024c7cd88ccb09d7ab4ce0d90f348569" not yet present on node "ansible-control-plane", loading...
- Created an SDep with a 60-second progress deadline
$ git diff -U1 notebooks/resources/model.yaml
diff --git a/notebooks/resources/model.yaml b/notebooks/resources/model.yaml
index a1e066e8e..d65aae10c 100644
--- a/notebooks/resources/model.yaml
+++ b/notebooks/resources/model.yaml
@@ -10,3 +10,3 @@ spec:
containers:
- - image: seldonio/mock_classifier:1.9.1
+ - image: seldonio/mock_classifier:1.15.0-dev
name: classifier
@@ -19 +19,2 @@ spec:
replicas: 1
+ progressDeadlineSeconds: 60
$ kubectl apply -f notebooks/resources/model.yaml
- Waited a minute and confirmed the SDep and its underlying deployment were marked as failed
$ k -n seldon get deployments.apps seldon-model-example-0-classifier -ojsonpath={..status} | jq
{
"conditions": [
{
"lastTransitionTime": "2022-07-28T17:07:39Z",
"lastUpdateTime": "2022-07-28T17:07:39Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "Available"
},
{
"lastTransitionTime": "2022-07-28T17:08:40Z",
"lastUpdateTime": "2022-07-28T17:08:40Z",
"message": "ReplicaSet \"seldon-model-example-0-classifier-58bfd5d7b\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "False",
"type": "Progressing"
}
],
"observedGeneration": 1,
"replicas": 1,
"unavailableReplicas": 1,
"updatedReplicas": 1
}
$ k -n seldon get sdep seldon-model -o jsonpath={..status} | jq
{
"address": {
"url": "http://seldon-model-example.seldon.svc.cluster.local:8000/api/v1.0/predictions"
},
"conditions": [
{
"lastTransitionTime": "2022-07-28T17:07:39Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "DeploymentsReady"
},
{
"lastTransitionTime": "2022-07-28T17:07:39Z",
"reason": "No HPAs defined",
"status": "True",
"type": "HpasReady"
},
{
"lastTransitionTime": "2022-07-28T17:07:39Z",
"reason": "No KEDA resources defined",
"status": "True",
"type": "KedaReady"
},
{
"lastTransitionTime": "2022-07-28T17:07:39Z",
"reason": "No PDBs defined",
"status": "True",
"type": "PdbsReady"
},
{
"lastTransitionTime": "2022-07-28T17:08:40Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2022-07-28T17:07:39Z",
"reason": "Not all services created",
"status": "False",
"type": "ServicesReady"
}
],
"deploymentStatus": {
"seldon-model-example-0-classifier": {
"replicas": 1
}
},
"description": "Deployment is no longer progressing and not available.",
"replicas": 1,
"state": "Failed"
}
Nice one @agrski ! Will test from branch and review. Seems executor test failed but we need to validate if its just the flaky one.
/text integration
Nice one @agrski ! Will test from branch and review. Seems executor test failed but we need to validate if its just the flaky one.
/text integration
From checking GHA, looks like both failures are flaky. Note this is still in draft as I also need to check in some changes to manifests -- just trying to fix my controller-gen setup as it's giving me a weird version and some changes I didn't expect.
/test integration
/test integration
Not sure why the logs / updates from the CI tests are not appearing, but looking at the logs it seems there were a few failed tests, although it does seem like there may be an issue in the CI so I will have a look

/test integration
Force pushed after rebasing onto master
/test integration
Doesn't look like integration tests want to run -- I'm not seeing anything for this PR in JX
Edit: the UI was lying. I can see the logs for the pod corresponding to this PR are progressing, but they're very slow and full of failed connections to endpoints, so I'm guessing things are a bit slow to spin up in the tests.
@agrski there's actually an issue with the UI, more details here https://github.com/SeldonIO/seldon-core/issues/4262
Running integration for retest of CI /test integration
I've stopped the pipeline manually with jx stop pipeline to explore the webhook issue separate to this PR and look towards unblocking
/test integration
/test notebooks
Failed test is TestNotebooks.test_protocol_examples which seems flaky, rerunning to validate
/test notebooks

/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: axsaucedo
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [axsaucedo]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment