Deploy fails unnecessarily on temporary image pull error
Bug report
Due to a bug in Azure AKS around stale tokens it can happen that pulling an image for a pod only succeeds on the second try. The first failed pull attempt does not count as a container restart, hence it will be performed also on task pods that have a restartPolicy of Never.
The problem is that krane's monitoring of the pod still understands this as a failed pod and immediately exists with an error. The pod will still run and execute his task.
The problematic part of the code seems to be at https://github.com/Shopify/krane/blob/master/lib/krane/kubernetes_resource/pod.rb#L225
Here is an anonymised output of the task pod (kubectl get events):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 93s default-scheduler Successfully assigned xyz/db-migration-1619185424 to aks-nodepool1-10965871-vmss0000a8
Warning Failed 108s kubelet Failed to pull image "xyz.azurecr.io/xyz:master-574": [rpc error: code = NotFound desc = failed to pull and unpack image "xyz.azurecr.io/xyz:master-574": failed to resolve reference "xyz.azurecr.io/xyz:master-574": xyz.azurecr.io/xyz:master-574: not found, rpc error: code = Unknown desc = failed to pull and unpack image "xyz.azurecr.io/xyz:master-574": failed to resolve reference "xyz.azurecr.io/xyz:master-574": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]
Warning Failed 108s kubelet Error: ErrImagePull
Normal BackOff 107s kubelet Back-off pulling image "xyz.azurecr.io/xyz:master-574"
Warning Failed 107s kubelet Error: ImagePullBackOff
Normal Pulling 95s (x2 over 108s) kubelet Pulling image "xyz.azurecr.io/xyz:master-574"
Normal Pulled 91s kubelet Successfully pulled image "xyz.azurecr.io/xyz:master-574" in 4.0215664s
Normal Created 89s kubelet Created container db-migration
Normal Started 89s kubelet Started container db-migration
Expected behavior: The monitoring of the pod ignores temporary image pull problems and keeps the deploy running.
Version(s) affected: Newest / all / 2.1.8
I believe we are running into a similar problem with the latest version:
We have istio deployed in our cluster and the istio-proxy sidecar seems to have some trouble with the initial start...
If krane checks in the wrong moment, the deployment is marked as failed, although after 2-3 retries the pod is health:
25m Normal Pulled pod/abc-7b588d4779-g9tgj Successfully pulled image "rancher/mirrored-istio-proxyv2:1.8.3" in 717.6511ms
25m Normal Created pod/abc-7b588d4779-g9tgj Created container istio-proxy
25m Normal Started pod/abc-7b588d4779-g9tgj Started container istio-proxy
24m Normal Pulled pod/abc-7b588d4779-g9tgj Container image "hub.domain.net/dev/abc:stagingdeploymenttest-9912299e" already present on machine
25m Warning Unhealthy pod/abc-7b588d4779-g9tgj Readiness probe failed: Get "http://10.42.14.98:15021/healthz/ready": dial tcp 10.42.14.98:15021: connect: connection refused
24m Warning BackOff pod/abc-7b588d4779-g9tgj Back-off restarting failed container
21m Normal Killing pod/abc-7b588d4779-g9tgj Stopping container istio-proxy
21m Normal Killing pod/abc-7b588d4779-g9tgj Stopping container payment
21m Warning Unhealthy pod/abc-7b588d4779-g9tgj Readiness probe failed: HTTP probe failed with statuscode: 503
25m Normal Scheduled pod/abc-7b588d4779-whzx7 Successfully assigned web-store-525-staging/abc-7b588d4779-whzx7 to ranch04
25m Normal Pulling pod/abc-7b588d4779-whzx7 Pulling image "rancher/mirrored-istio-proxyv2:1.8.3"
25m Normal Pulled pod/abc-7b588d4779-whzx7 Successfully pulled image "rancher/mirrored-istio-proxyv2:1.8.3" in 727.980297ms
25m Normal Created pod/abc-7b588d4779-whzx7 Created container istio-init
25m Normal Started pod/abc-7b588d4779-whzx7 Started container istio-init
25m Normal Pulled pod/abc-7b588d4779-whzx7 Container image "hub.domain.net/dev/abc:stagingdeploymenttest-9912299e" already present on machine
25m Normal Created pod/abc-7b588d4779-whzx7 Created container payment
25m Normal Started pod/abc-7b588d4779-whzx7 Started container payment
25m Normal Pulling pod/abc-7b588d4779-whzx7 Pulling image "rancher/mirrored-istio-proxyv2:1.8.3"
25m Normal Pulled pod/abc-7b588d4779-whzx7 Successfully pulled image "rancher/mirrored-istio-proxyv2:1.8.3" in 925.865518ms
25m Normal Created pod/abc-7b588d4779-whzx7 Created container istio-proxy
25m Normal Started pod/abc-7b588d4779-whzx7 Started container istio-proxy
[FATAL][2021-07-02 12:03:27 +0000] The following containers are in a state that is unlikely to be recoverable:
[FATAL][2021-07-02 12:03:27 +0000] > payment: Crashing repeatedly (exit 1). See logs for more information.
[FATAL][2021-07-02 12:03:27 +0000]
[FATAL][2021-07-02 12:03:27 +0000] - Final status: 3 replicas, 1 updatedReplica, 2 availableReplicas, 1 unavailableReplica
[FATAL][2021-07-02 12:03:27 +0000] - Events (common success events excluded):
[FATAL][2021-07-02 12:03:27 +0000] [Deployment/staging-payment] ScalingReplicaSet: Scaled up replica set staging-payment-76bcf4795c to 1 (1 events)
[FATAL][2021-07-02 12:03:27 +0000] [Deployment/staging-payment] ScalingReplicaSet: Scaled down replica set staging-payment-69465f84f to 1 (1 events)
[FATAL][2021-07-02 12:03:27 +0000] [Deployment/staging-payment] ScalingReplicaSet: Scaled up replica set staging-payment-76bcf4795c to 2 (1 events)
[FATAL][2021-07-02 12:03:27 +0000] [Deployment/staging-payment] ScalingReplicaSet: Scaled down replica set staging-payment-69465f84f to 0 (1 events)
[FATAL][2021-07-02 12:03:27 +0000] [Pod/staging-payment-76bcf4795c-zh5hw] Unhealthy: Readiness probe failed: Get "http://10.42.13.96:15021/healthz/ready": dial tcp 10.42.13.96:15021: connect: connection refused (1 events)
[FATAL][2021-07-02 12:03:27 +0000] [Pod/staging-payment-76bcf4795c-zh5hw] BackOff: Back-off restarting failed container (3 events)
[FATAL][2021-07-02 12:03:27 +0000] - Logs from container 'payment':
[FATAL][2021-07-02 12:03:27 +0000] 2021/07/02 14:02:38 instana: ERROR: Cannot announce sensor. Scheduling retry.
[FATAL][2021-07-02 12:03:27 +0000] 2021/07/02 14:02:38 instana: ERROR: Cannot announce sensor. Scheduling retry.
(10.42.13.96:15021 is the istio-proxy health check)
Maybe some kind of retry or timeout can be adapted? Although i think this might me more of a istio-problem, it might be useful to implement sth. like this in krane too.
regards, strowi
Thanks for the issue @perbernhardt! I'll take a look at this early next week. Hopefully it won't be too difficult of a fix, thanks for doing the background digging already.
So there's no easy way for Krane to get the number of failed image pulls, since we don't observe Event objects in the namespace, just the Status fields of the resources being deployed. However, the ResourceWatcher has a sync delay of 3 seconds, so we can use that to get a reliable time measurement. Using that, we can increment the number of times we observe an image pull failure, and only raise an error after a suitable threshold.
@perbernhardt do you have an idea of what such an acceptable time window would be? Would it be sufficient to also check for 401 unauthorized in the error message so that we can still fail fast if the image is indeed a 404?
@timothysmith0609 So we also got feedback from our Cloud Provider on this: The underlying cause for this is that we use a regional replicated container registry (Azure ACR). So basically we build a new image and push that to the registry before we start deploying using Krane. But we deploy against different regions (e.g. Canada & Australia) and we got the feedback that it can take "some minutes" before the image is available in the target region. There is also no way of differentiating between "not there" and "not yet there". So I would still wait for a successful retry in case of any 404 and perhaps we could go with something like two minutes?
By default, I would prefer us to fail fast in the event of an image pull. Said that, I do sympathize with the situation you have where waiting on replication lag throws a wrench in your deploy process. I'm wondering how best to implement this. From what you say, it really only matters when you're trying to run raw Pods (as opposed to deployment, statefulset, etc.)? If that's the case, maybe just explicitly setting this condition when we discover resources is sufficient. What do you think?
Yes, that sounds reasonable. The problem is really only for "raw" pods / tasks.
Ok, I think I know how I'll tackle this. Hope to get something done shortly.
Take a look at https://github.com/Shopify/krane/pull/834 when you have a moment, I think it should solve this issue.
I still feel it's a bit cumbersome to put this check in the code, but the change is minor enough that I'm not too bothered by it. Said that, are you certain there's nothing you can do to wait for the image to be replicated across regions before invoking the deploy process?
Thanks for the fix. To be honest, I am totally with you that it is cumbersome. Quite a lot of code for this weird situation. Right now we mitigate it by simply disabling the replication feature of the registry. This slows down repeated rollouts / scalings of the deployment, but at least it is predictable to work.
Hi @perbernhardt , sorry for the delay in response, I was quite ill for the last little while. I've been thinking about this PR and have ultimately decided that it's a bit too cumbersome to merge. Ultimately, I'd like to take an opinionated approach one way or the other, without needing a command-line flag to discriminate. As you say, image pull failures in and of themselves don't appear to mark a pod as failed or doomed, so perhaps the right approach is to wait for timeouts on image pull problems. What do you think of that?
Hi @timothysmith0609 , no worries, sorry to hear that you were ill. So what you describe is to simply ignore any image pull related problems and just wait for the fact that the deployment will eventually fail due to timing out in case of a really missing image, right? I quite like that, because it solves the problem more conceptually. The only thing that will get "worse" is that if you really specify a none existing image, then it will take until the timeout of the deployment to notice that, which might seem quite long. But then again I would say that deploying a manifest that refers to a non-existing image is nothing that will occur "from time to time" / often, because it is just that you have a broken pipeline (e.g. typo), which we probably do not need to optimise for 👍
Closing as won't fix. Feel free to reopen the issue if you come up with a workable implementation