seldon-core icon indicating copy to clipboard operation
seldon-core copied to clipboard

Add progress deadline support for SDeps

Open agrski opened this issue 3 years ago • 15 comments

What this PR does / why we need it: Currently, Seldon Deployments do not support setting progressDeadlineSeconds, which is an optional setting used to control when a deployment is marked as failed.

This field is defined for deployments, not pods or containers, thus users cannot set this manually by overriding component specs in an SDep.

As progressDeadlineSeconds is optional in Kubernetes, with a default of 600 seconds, this change is backwards compatible -- any SDep not specifying this will use the Kubernetes default automatically.

Which issue(s) this PR fixes:

Fixes #4224

Special notes for your reviewer:

agrski avatar Jul 28 '22 17:07 agrski

Testing these changes was a bit of a convoluted process but did confirm the intended behaviour.

For reference, I did the following:

  • Created a fresh kind cluster called ansible (using ansible-k8s-collection)
  • Generated new CRDs and a corresponding controller/operator image
$ make -C operator/ manifests
$ make -C operator/ kind-image-install
  • Ensured the Seldon Core Helm charts were up to date locally
$ make -C operator/helm/ create
  • Released the local Helm chart
$ helm upgrade --install seldon-core ./helm-charts/seldon-core-operator/ -n seldon-system
  • Modified the mean_classifier example to delay for a long time (1000 seconds) in its __init__() call
$ git diff -U1 examples/models/mean_classifier/MeanClassifier.py
diff --git a/examples/models/mean_classifier/MeanClassifier.py b/examples/models/mean_classifier/MeanClassifier.py
index f3bbf2421..f899b0d39 100644
--- a/examples/models/mean_classifier/MeanClassifier.py
+++ b/examples/models/mean_classifier/MeanClassifier.py
@@ -10,3 +10,3 @@ class MeanClassifier(object):
 
-    def __init__(self, intValue=0, delaySecs=0):
+    def __init__(self, intValue=0, delaySecs=1000):
         self.class_names = ["proba"]
@@ -21,2 +21,4 @@ class MeanClassifier(object):
         logging.info("loading model here")
+        time.sleep(self.delay_secs)
+        logging.info("loaded model after delay")
  • Built and loaded this modified model into my kind cluster
$ git diff -U0 examples/models/mean_classifier/Makefile 
diff --git a/examples/models/mean_classifier/Makefile b/examples/models/mean_classifier/Makefile
index 5d620a154..173c36b75 100644
--- a/examples/models/mean_classifier/Makefile
+++ b/examples/models/mean_classifier/Makefile
@@ -17 +17 @@ kind_load: build
-       kind load -v 3 docker-image ${IMAGE_BASE}:${VERSION}
+       kind load -v 3 docker-image ${IMAGE_BASE}:${VERSION} --name=ansible

$ make -C examples/models/mean_classifier/ kind_load
s2i build -E environment . seldonio/seldon-core-s2i-python38-ubi8:1.15.0-dev seldonio/mock_classifier:1.15.0-dev
...
kind load -v 3 docker-image seldonio/mock_classifier:1.15.0-dev --name=ansible
Image: "seldonio/mock_classifier:1.15.0-dev" with ID "sha256:7dbdc1af0116fa15612f5027c677440b024c7cd88ccb09d7ab4ce0d90f348569" not yet present on node "ansible-control-plane", loading...
  • Created an SDep with a 60-second progress deadline
$ git diff -U1 notebooks/resources/model.yaml 
diff --git a/notebooks/resources/model.yaml b/notebooks/resources/model.yaml
index a1e066e8e..d65aae10c 100644
--- a/notebooks/resources/model.yaml
+++ b/notebooks/resources/model.yaml
@@ -10,3 +10,3 @@ spec:
         containers:
-        - image: seldonio/mock_classifier:1.9.1
+        - image: seldonio/mock_classifier:1.15.0-dev
           name: classifier
@@ -19 +19,2 @@ spec:
     replicas: 1
+    progressDeadlineSeconds: 60

$ kubectl apply -f notebooks/resources/model.yaml 
  • Waited a minute and confirmed the SDep and its underlying deployment were marked as failed
$ k -n seldon get deployments.apps seldon-model-example-0-classifier -ojsonpath={..status} | jq
{
  "conditions": [
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "lastUpdateTime": "2022-07-28T17:07:39Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2022-07-28T17:08:40Z",
      "lastUpdateTime": "2022-07-28T17:08:40Z",
      "message": "ReplicaSet \"seldon-model-example-0-classifier-58bfd5d7b\" has timed out progressing.",
      "reason": "ProgressDeadlineExceeded",
      "status": "False",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1,
  "replicas": 1,
  "unavailableReplicas": 1,
  "updatedReplicas": 1
}

$ k -n seldon get sdep seldon-model -o jsonpath={..status} | jq
{
  "address": {
    "url": "http://seldon-model-example.seldon.svc.cluster.local:8000/api/v1.0/predictions"
  },
  "conditions": [
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "DeploymentsReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "No HPAs defined",
      "status": "True",
      "type": "HpasReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "No KEDA resources defined",
      "status": "True",
      "type": "KedaReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "No PDBs defined",
      "status": "True",
      "type": "PdbsReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:08:40Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "Not all services created",
      "status": "False",
      "type": "ServicesReady"
    }
  ],
  "deploymentStatus": {
    "seldon-model-example-0-classifier": {
      "replicas": 1
    }
  },
  "description": "Deployment is no longer progressing and not available.",
  "replicas": 1,
  "state": "Failed"
}

agrski avatar Jul 29 '22 10:07 agrski

Nice one @agrski ! Will test from branch and review. Seems executor test failed but we need to validate if its just the flaky one.

/text integration

axsaucedo avatar Jul 29 '22 10:07 axsaucedo

Nice one @agrski ! Will test from branch and review. Seems executor test failed but we need to validate if its just the flaky one.

/text integration

From checking GHA, looks like both failures are flaky. Note this is still in draft as I also need to check in some changes to manifests -- just trying to fix my controller-gen setup as it's giving me a weird version and some changes I didn't expect.

agrski avatar Jul 29 '22 10:07 agrski

/test integration

axsaucedo avatar Jul 30 '22 14:07 axsaucedo

/test integration

axsaucedo avatar Aug 02 '22 08:08 axsaucedo

Not sure why the logs / updates from the CI tests are not appearing, but looking at the logs it seems there were a few failed tests, although it does seem like there may be an issue in the CI so I will have a look

image

axsaucedo avatar Aug 02 '22 08:08 axsaucedo

/test integration

axsaucedo avatar Aug 02 '22 09:08 axsaucedo

Force pushed after rebasing onto master

agrski avatar Aug 05 '22 16:08 agrski

/test integration

axsaucedo avatar Aug 08 '22 09:08 axsaucedo

Doesn't look like integration tests want to run -- I'm not seeing anything for this PR in JX

Edit: the UI was lying. I can see the logs for the pod corresponding to this PR are progressing, but they're very slow and full of failed connections to endpoints, so I'm guessing things are a bit slow to spin up in the tests.

agrski avatar Aug 08 '22 09:08 agrski

@agrski there's actually an issue with the UI, more details here https://github.com/SeldonIO/seldon-core/issues/4262

axsaucedo avatar Aug 08 '22 10:08 axsaucedo

Running integration for retest of CI /test integration

axsaucedo avatar Aug 08 '22 11:08 axsaucedo

I've stopped the pipeline manually with jx stop pipeline to explore the webhook issue separate to this PR and look towards unblocking

axsaucedo avatar Aug 08 '22 11:08 axsaucedo

/test integration

axsaucedo avatar Sep 05 '22 10:09 axsaucedo

/test notebooks

axsaucedo avatar Sep 05 '22 10:09 axsaucedo

Failed test is TestNotebooks.test_protocol_examples which seems flaky, rerunning to validate /test notebooks

axsaucedo avatar Sep 07 '22 16:09 axsaucedo

image

/approve

axsaucedo avatar Sep 08 '22 07:09 axsaucedo

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: axsaucedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

seldondev avatar Sep 08 '22 07:09 seldondev