argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors

Open abhijeetviswa opened this issue 2 years ago • 3 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issue exists when I tested with :latest
  • [X] I have searched existing issues and could not find a match for this bug
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I was in the process of enabling node status offloading. As part of this I set the ALWAYS_OFFLOAD_NODE_STATUS environment variable and missed making the nodeStatusOffload property in the configmap to true.

This resulted in the workflows failing with the error: offload node status is not supported (cli) and Workflow operation error (ui).

While this is definitely a configuration issue, it would be nice to have the controller ignore the ALWAYS_OFFLOAD_NODE_STATUS flag if the nodeStatusOffload setting is set to false so that users don't have to disable both if they need to disable node status offloading.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This is an issue with any workflow, so pasting the default workflow here. 
metadata:
  name: fantastic-python
  labels:
    example: 'true'
spec:
  arguments:
    parameters:
      - name: message
        value: hello argo
  entrypoint: argosay
  templates:
    - name: argosay
      inputs:
        parameters:
          - name: message
            value: '{{workflow.parameters.message}}'
      container:
        name: main
        image: 'argoproj/argosay:v2'
        command:
          - /argosay
        args:
          - echo
          - '{{inputs.parameters.message}}'
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

Type     Reason           Age   From                 Message
  ----     ------           ----  ----                 -------
  Normal   WorkflowRunning  65s   workflow-controller  Workflow Running
  Warning  WorkflowFailed   65s   workflow-controller  offload node status is not supported

Logs from in your workflow's wait container

This was not done :(

abhijeetviswa avatar Jan 22 '24 13:01 abhijeetviswa

You have asked for something to happen and not configured it. The controller is invalidly configured.

I wouldn't support being functional in this case, but we could just refused to start the controller instead, but it feels less helpful. The cli error is being as useful as it can be. Blindly carrying on in the light of bad configuration might be sane for some kinds of applications, but I disagree with doing it for a controller like argo workflows.

I'd support an improvement to the UI error message in this case, but otherwise think the current behavior is otherwise correct.

Joibel avatar Jan 22 '24 13:01 Joibel

but we could just refused to start the controller instead

Yea I think this would be the proper route, or at the very least have the Controller log out a critical error

Logs from the workflow controller

Notably, these logs are actually missing from the issue

and Workflow operation error (ui).

This is actually coming from the Controller, it's a message on the entry node

offload node status is not supported (cli)

this one is a bit more interesting as it's mentioned in the docs but this scenario is not mentioned as what could happen.

This error message comes from the DB code. I'm not sure exactly how the CLI chose this error to show specifically. Which command in the CLI did you use to get that? argo list?

agilgur5 avatar Jan 22 '24 18:01 agilgur5

I'm not aware of the command that was run to get the logs since it was given to me by a team member. From what I understood it was a kubectl describe on the controller itself.

I'll try reproducing this again with a local argo setup over the weekend and come up with a more concrete set of logs and commands to reproduce this issue.

I'm not sure exactly how the CLI chose this error to show specifically.

From my very limited understanding of the controller log and from memory at looking at the kubectl logs of the controller, it comes up when a node is "hydrated/dehydrated". This if condition succeeds since alwaysOffloadNodeStatus is true, while h.offLoadNodeStatusRepo has the default value to ExplosiveOffloadNodeStatusRepo since nodeStatusOffload is false.

abhijeetviswa avatar Jan 23 '24 06:01 abhijeetviswa