Setting `ALWAYS_OFFLOAD_NODE_STATUS` without setting `nodeStatusOffLoad` results in workflow errors
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issue exists when I tested with
:latest - [X] I have searched existing issues and could not find a match for this bug
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what did you expect to happen?
I was in the process of enabling node status offloading. As part of this I set the ALWAYS_OFFLOAD_NODE_STATUS environment variable and missed making the nodeStatusOffload property in the configmap to true.
This resulted in the workflows failing with the error: offload node status is not supported (cli) and Workflow operation error (ui).
While this is definitely a configuration issue, it would be nice to have the controller ignore the ALWAYS_OFFLOAD_NODE_STATUS flag if the nodeStatusOffload setting is set to false so that users don't have to disable both if they need to disable node status offloading.
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
# This is an issue with any workflow, so pasting the default workflow here.
metadata:
name: fantastic-python
labels:
example: 'true'
spec:
arguments:
parameters:
- name: message
value: hello argo
entrypoint: argosay
templates:
- name: argosay
inputs:
parameters:
- name: message
value: '{{workflow.parameters.message}}'
container:
name: main
image: 'argoproj/argosay:v2'
command:
- /argosay
args:
- echo
- '{{inputs.parameters.message}}'
ttlStrategy:
secondsAfterCompletion: 300
podGC:
strategy: OnPodCompletion
Logs from the workflow controller
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WorkflowRunning 65s workflow-controller Workflow Running
Warning WorkflowFailed 65s workflow-controller offload node status is not supported
Logs from in your workflow's wait container
This was not done :(
You have asked for something to happen and not configured it. The controller is invalidly configured.
I wouldn't support being functional in this case, but we could just refused to start the controller instead, but it feels less helpful. The cli error is being as useful as it can be. Blindly carrying on in the light of bad configuration might be sane for some kinds of applications, but I disagree with doing it for a controller like argo workflows.
I'd support an improvement to the UI error message in this case, but otherwise think the current behavior is otherwise correct.
but we could just refused to start the controller instead
Yea I think this would be the proper route, or at the very least have the Controller log out a critical error
Logs from the workflow controller
Notably, these logs are actually missing from the issue
and
Workflow operation error(ui).
This is actually coming from the Controller, it's a message on the entry node
offload node status is not supported(cli)
this one is a bit more interesting as it's mentioned in the docs but this scenario is not mentioned as what could happen.
This error message comes from the DB code. I'm not sure exactly how the CLI chose this error to show specifically. Which command in the CLI did you use to get that? argo list?
I'm not aware of the command that was run to get the logs since it was given to me by a team member. From what I understood it was a kubectl describe on the controller itself.
I'll try reproducing this again with a local argo setup over the weekend and come up with a more concrete set of logs and commands to reproduce this issue.
I'm not sure exactly how the CLI chose this error to show specifically.
From my very limited understanding of the controller log and from memory at looking at the kubectl logs of the controller, it comes up when a node is "hydrated/dehydrated". This if condition succeeds since alwaysOffloadNodeStatus is true, while h.offLoadNodeStatusRepo has the default value to ExplosiveOffloadNodeStatusRepo since nodeStatusOffload is false.