krane icon indicating copy to clipboard operation
krane copied to clipboard

Deploy fails unnecessarily on temporary image pull error

Open perbernhardt opened this issue 4 years ago • 11 comments

Bug report

Due to a bug in Azure AKS around stale tokens it can happen that pulling an image for a pod only succeeds on the second try. The first failed pull attempt does not count as a container restart, hence it will be performed also on task pods that have a restartPolicy of Never. The problem is that krane's monitoring of the pod still understands this as a failed pod and immediately exists with an error. The pod will still run and execute his task. The problematic part of the code seems to be at https://github.com/Shopify/krane/blob/master/lib/krane/kubernetes_resource/pod.rb#L225

Here is an anonymised output of the task pod (kubectl get events):

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  93s                 default-scheduler  Successfully assigned xyz/db-migration-1619185424 to aks-nodepool1-10965871-vmss0000a8
  Warning  Failed     108s                kubelet            Failed to pull image "xyz.azurecr.io/xyz:master-574": [rpc error: code = NotFound desc = failed to pull and unpack image "xyz.azurecr.io/xyz:master-574": failed to resolve reference "xyz.azurecr.io/xyz:master-574": xyz.azurecr.io/xyz:master-574: not found, rpc error: code = Unknown desc = failed to pull and unpack image "xyz.azurecr.io/xyz:master-574": failed to resolve reference "xyz.azurecr.io/xyz:master-574": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]
  Warning  Failed     108s                kubelet            Error: ErrImagePull
  Normal   BackOff    107s                kubelet            Back-off pulling image "xyz.azurecr.io/xyz:master-574"
  Warning  Failed     107s                kubelet            Error: ImagePullBackOff
  Normal   Pulling    95s (x2 over 108s)  kubelet            Pulling image "xyz.azurecr.io/xyz:master-574"
  Normal   Pulled     91s                 kubelet            Successfully pulled image "xyz.azurecr.io/xyz:master-574" in 4.0215664s
  Normal   Created    89s                 kubelet            Created container db-migration
  Normal   Started    89s                 kubelet            Started container db-migration

Expected behavior: The monitoring of the pod ignores temporary image pull problems and keeps the deploy running.

Version(s) affected: Newest / all / 2.1.8

perbernhardt avatar Apr 28 '21 18:04 perbernhardt

I believe we are running into a similar problem with the latest version:

We have istio deployed in our cluster and the istio-proxy sidecar seems to have some trouble with the initial start...

If krane checks in the wrong moment, the deployment is marked as failed, although after 2-3 retries the pod is health:

25m         Normal    Pulled                   pod/abc-7b588d4779-g9tgj                       Successfully pulled image "rancher/mirrored-istio-proxyv2:1.8.3" in 717.6511ms
25m         Normal    Created                  pod/abc-7b588d4779-g9tgj                       Created container istio-proxy
25m         Normal    Started                  pod/abc-7b588d4779-g9tgj                       Started container istio-proxy
24m         Normal    Pulled                   pod/abc-7b588d4779-g9tgj                       Container image "hub.domain.net/dev/abc:stagingdeploymenttest-9912299e" already present on machine
25m         Warning   Unhealthy                pod/abc-7b588d4779-g9tgj                       Readiness probe failed: Get "http://10.42.14.98:15021/healthz/ready": dial tcp 10.42.14.98:15021: connect: connection refused
24m         Warning   BackOff                  pod/abc-7b588d4779-g9tgj                       Back-off restarting failed container
21m         Normal    Killing                  pod/abc-7b588d4779-g9tgj                       Stopping container istio-proxy
21m         Normal    Killing                  pod/abc-7b588d4779-g9tgj                       Stopping container payment
21m         Warning   Unhealthy                pod/abc-7b588d4779-g9tgj                       Readiness probe failed: HTTP probe failed with statuscode: 503
25m         Normal    Scheduled                pod/abc-7b588d4779-whzx7                       Successfully assigned web-store-525-staging/abc-7b588d4779-whzx7 to ranch04
25m         Normal    Pulling                  pod/abc-7b588d4779-whzx7                       Pulling image "rancher/mirrored-istio-proxyv2:1.8.3"
25m         Normal    Pulled                   pod/abc-7b588d4779-whzx7                       Successfully pulled image "rancher/mirrored-istio-proxyv2:1.8.3" in 727.980297ms
25m         Normal    Created                  pod/abc-7b588d4779-whzx7                       Created container istio-init
25m         Normal    Started                  pod/abc-7b588d4779-whzx7                       Started container istio-init
25m         Normal    Pulled                   pod/abc-7b588d4779-whzx7                       Container image "hub.domain.net/dev/abc:stagingdeploymenttest-9912299e" already present on machine
25m         Normal    Created                  pod/abc-7b588d4779-whzx7                       Created container payment
25m         Normal    Started                  pod/abc-7b588d4779-whzx7                       Started container payment
25m         Normal    Pulling                  pod/abc-7b588d4779-whzx7                       Pulling image "rancher/mirrored-istio-proxyv2:1.8.3"
25m         Normal    Pulled                   pod/abc-7b588d4779-whzx7                       Successfully pulled image "rancher/mirrored-istio-proxyv2:1.8.3" in 925.865518ms
25m         Normal    Created                  pod/abc-7b588d4779-whzx7                       Created container istio-proxy
25m         Normal    Started                  pod/abc-7b588d4779-whzx7                       Started container istio-proxy
[FATAL][2021-07-02 12:03:27 +0000]  The following containers are in a state that is unlikely to be recoverable:
[FATAL][2021-07-02 12:03:27 +0000]  > payment: Crashing repeatedly (exit 1). See logs for more information.
[FATAL][2021-07-02 12:03:27 +0000]  
[FATAL][2021-07-02 12:03:27 +0000]    - Final status: 3 replicas, 1 updatedReplica, 2 availableReplicas, 1 unavailableReplica
[FATAL][2021-07-02 12:03:27 +0000]    - Events (common success events excluded):
[FATAL][2021-07-02 12:03:27 +0000]        [Deployment/staging-payment]  ScalingReplicaSet: Scaled up replica set staging-payment-76bcf4795c to 1 (1 events)
[FATAL][2021-07-02 12:03:27 +0000]        [Deployment/staging-payment]  ScalingReplicaSet: Scaled down replica set staging-payment-69465f84f to 1 (1 events)
[FATAL][2021-07-02 12:03:27 +0000]        [Deployment/staging-payment]  ScalingReplicaSet: Scaled up replica set staging-payment-76bcf4795c to 2 (1 events)
[FATAL][2021-07-02 12:03:27 +0000]        [Deployment/staging-payment]  ScalingReplicaSet: Scaled down replica set staging-payment-69465f84f to 0 (1 events)
[FATAL][2021-07-02 12:03:27 +0000]        [Pod/staging-payment-76bcf4795c-zh5hw]  Unhealthy: Readiness probe failed: Get "http://10.42.13.96:15021/healthz/ready": dial tcp 10.42.13.96:15021: connect: connection refused (1 events)
[FATAL][2021-07-02 12:03:27 +0000]        [Pod/staging-payment-76bcf4795c-zh5hw]  BackOff: Back-off restarting failed container (3 events)
[FATAL][2021-07-02 12:03:27 +0000]    - Logs from container 'payment':
[FATAL][2021-07-02 12:03:27 +0000]        2021/07/02 14:02:38 instana: ERROR: Cannot announce sensor. Scheduling retry.
[FATAL][2021-07-02 12:03:27 +0000]        2021/07/02 14:02:38 instana: ERROR: Cannot announce sensor. Scheduling retry.

(10.42.13.96:15021 is the istio-proxy health check)

Maybe some kind of retry or timeout can be adapted? Although i think this might me more of a istio-problem, it might be useful to implement sth. like this in krane too.

regards, strowi

strowi avatar Jul 08 '21 09:07 strowi

Thanks for the issue @perbernhardt! I'll take a look at this early next week. Hopefully it won't be too difficult of a fix, thanks for doing the background digging already.

timothysmith0609 avatar Jul 22 '21 17:07 timothysmith0609

So there's no easy way for Krane to get the number of failed image pulls, since we don't observe Event objects in the namespace, just the Status fields of the resources being deployed. However, the ResourceWatcher has a sync delay of 3 seconds, so we can use that to get a reliable time measurement. Using that, we can increment the number of times we observe an image pull failure, and only raise an error after a suitable threshold.

@perbernhardt do you have an idea of what such an acceptable time window would be? Would it be sufficient to also check for 401 unauthorized in the error message so that we can still fail fast if the image is indeed a 404?

timothysmith0609 avatar Aug 09 '21 19:08 timothysmith0609

@timothysmith0609 So we also got feedback from our Cloud Provider on this: The underlying cause for this is that we use a regional replicated container registry (Azure ACR). So basically we build a new image and push that to the registry before we start deploying using Krane. But we deploy against different regions (e.g. Canada & Australia) and we got the feedback that it can take "some minutes" before the image is available in the target region. There is also no way of differentiating between "not there" and "not yet there". So I would still wait for a successful retry in case of any 404 and perhaps we could go with something like two minutes?

perbernhardt avatar Aug 10 '21 08:08 perbernhardt

By default, I would prefer us to fail fast in the event of an image pull. Said that, I do sympathize with the situation you have where waiting on replication lag throws a wrench in your deploy process. I'm wondering how best to implement this. From what you say, it really only matters when you're trying to run raw Pods (as opposed to deployment, statefulset, etc.)? If that's the case, maybe just explicitly setting this condition when we discover resources is sufficient. What do you think?

timothysmith0609 avatar Aug 10 '21 16:08 timothysmith0609

Yes, that sounds reasonable. The problem is really only for "raw" pods / tasks.

perbernhardt avatar Aug 11 '21 06:08 perbernhardt

Ok, I think I know how I'll tackle this. Hope to get something done shortly.

timothysmith0609 avatar Aug 11 '21 13:08 timothysmith0609

Take a look at https://github.com/Shopify/krane/pull/834 when you have a moment, I think it should solve this issue.

I still feel it's a bit cumbersome to put this check in the code, but the change is minor enough that I'm not too bothered by it. Said that, are you certain there's nothing you can do to wait for the image to be replicated across regions before invoking the deploy process?

timothysmith0609 avatar Aug 18 '21 18:08 timothysmith0609

Thanks for the fix. To be honest, I am totally with you that it is cumbersome. Quite a lot of code for this weird situation. Right now we mitigate it by simply disabling the replication feature of the registry. This slows down repeated rollouts / scalings of the deployment, but at least it is predictable to work.

perbernhardt avatar Aug 19 '21 05:08 perbernhardt

Hi @perbernhardt , sorry for the delay in response, I was quite ill for the last little while. I've been thinking about this PR and have ultimately decided that it's a bit too cumbersome to merge. Ultimately, I'd like to take an opinionated approach one way or the other, without needing a command-line flag to discriminate. As you say, image pull failures in and of themselves don't appear to mark a pod as failed or doomed, so perhaps the right approach is to wait for timeouts on image pull problems. What do you think of that?

timothysmith0609 avatar Sep 14 '21 13:09 timothysmith0609

Hi @timothysmith0609 , no worries, sorry to hear that you were ill. So what you describe is to simply ignore any image pull related problems and just wait for the fact that the deployment will eventually fail due to timing out in case of a really missing image, right? I quite like that, because it solves the problem more conceptually. The only thing that will get "worse" is that if you really specify a none existing image, then it will take until the timeout of the deployment to notice that, which might seem quite long. But then again I would say that deploying a manifest that refers to a non-existing image is nothing that will occur "from time to time" / often, because it is just that you have a broken pipeline (e.g. typo), which we probably do not need to optimise for 👍

perbernhardt avatar Sep 15 '21 04:09 perbernhardt

Closing as won't fix. Feel free to reopen the issue if you come up with a workable implementation

timothysmith0609 avatar Oct 17 '22 13:10 timothysmith0609