datree icon indicating copy to clipboard operation
datree copied to clipboard

Argo workflow retrypolicy should not retry when error code = 1

Open nielstenboom opened this issue 3 years ago • 7 comments

Describe the bug There's a defaultrule that ensures that argo workflows only catches relevant errors (code).

It specifically says to set the following expression in the retryPolicy of Argo Workflows:

lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) not in [0])

But with this expression, it means that application exceptions resulting in a error code of 1 will always retry, and the retries will fail with the same error code.

Should the expression not be the following?:

lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) not in [0,1])

Cheers!

nielstenboom avatar Jul 15 '22 08:07 nielstenboom

Hey @nielstenboom! Nice to e-meet you. What an interesting question.

This rule aims to ensure we don't miss any important errors and always retry if something has gone wrong, i.e exit code not 0. Check out our docs, we explained it in more detail. I'm a little bit confused with your suggestion, doesn't include not in [0,1] will cause retry to only when exist code is NOT 0 nor 1? I'm afraid this expression will miss failures.

Help me out here. Thank :)

noaabarki avatar Jul 17 '22 13:07 noaabarki

Hi @noaabarki, thank you for your swift response, nice to e-meet you as well!

Let me attempt to explain what I mean with an example.

The docs on this mention the following:

"retryPolicy=Always is too much. The user only wants to retry on system-level errors (eg, the node dying or being preempted), but not on errors occurring in user-level code since these failures indicate a bug. In addition, this option is more suitable for long-running containers than workflows which are jobs."

The docs mention that you do no want to retry on errors occurring in user-level code. So say you have the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: scripts-python-
spec:
  entrypoint: python-script-example
  templates:
  - name: python-script-example
    steps:
    - - name: generate
        template: exception

  - name: exception
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        raise Exception("general exception")

This workflow will now be retried with the expression recommended by Datree because the exit code = 1, but these retries will all result in the same error. This should be the same with more complex user-code steps as well. Therefore I wondered if you should retry steps that failed with error code = 1?

nielstenboom avatar Jul 18 '22 08:07 nielstenboom

Another example to illustrate my point:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: scripts-python-
spec:
  entrypoint: python-script-example
  templates:
  - name: python-script-example
    steps:
    - - name: generate
        template: out-of-memory

  - name: out-of-memory
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        x = bytearray(1024*1024*1000*1000)
        print("if it reaches this print, it worked")

This script will simulate an OOM error, I've tested it and this actually fails with a 137 exit code. And an OOM error is something you DO want to retry as opposed to a user-code bug.

image

nielstenboom avatar Jul 18 '22 08:07 nielstenboom

Hey @nielstenboom, I'm still a little bit confused. First, the examples you added will fall under the lastRetry.status == "Error" condition, and therefore it doesn't matter what is the exit code. Second, I still don't understand what is the added value adding [0,1] under the lastRetry.status == "Failed" condition.

noaabarki avatar Jul 19 '22 10:07 noaabarki

Hey @nielstenboom! Thought to ping you regarding this issue so we could fix or close it :)

noaabarki avatar Aug 01 '22 14:08 noaabarki

Hey @nielstenboom, I'm still a little bit confused. First, the examples you added will fall under the lastRetry.status == "Error" condition, and therefore it doesn't matter what is the exit code.

From the tests that I did this seems to not be the case 🤔 the examples fail with a "Failed" condition as it's not an argo workflow controller error. Only on argo workflow controller errors will the state become "Error" (https://argoproj.github.io/argo-workflows/retries/#retry-policies).

Second, I still don't understand what is the added value adding [0,1] under the lastRetry.status == "Failed" condition.

The docs on this describe that you do not want to retry errors that result from user-level code as these indicate a bug. Therefor I think you should not retry steps that failed with exit code = 1.

image

I ended up using the following expression for our argo workflows deployment:

lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) not in [-1,0,1])

It needs a -1 as well as for some steps the exit code is not available.

nielstenboom avatar Aug 01 '22 15:08 nielstenboom

Thank you @nielstenboom, for taking the time and replying with such detailed answer. You gave me a lot to think about, I'll dig into this and keep you updated I hope in the next 2-3 days :)

noaabarki avatar Aug 07 '22 12:08 noaabarki

Hi @nielstenboom,

I reviewed your examples. Thank you for the detailed response. It seems that the expression also suits the use-case you pointed and I understand and accept your concern that retrying on "Failed" status with errorCode 1 might be too much. However, I still think the expression should stay as it is sine Pod with "Failed" phase means that one or more of the pod’s containers terminated with an error, and Pod with "Error" phase status means argo workflow controller dictated errors. In both cases, we want to allow the user to be notified.

noaabarki avatar Sep 25 '22 07:09 noaabarki