Clear past errors from workflow state
Tracking issue
https://github.com/flyteorg/flyte/issues/4569
Why are the changes needed?
Reduce un-needed information stored in etcd when using failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. This allows flyte to scale to larger workflows before hitting etcd size limits.
What changes were proposed in this pull request?
- Clears old errors whenever new ones occur
- This is controlled by the
node-config.enable-cr-debug-metadataconfig option. Set this to true to restore the previous behaviour. - Behaviour is only changed when running workflows with
FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. Without this the workflow will fail as soon as there is one failure so there can never be more than one error regardless of this PR.
- This is controlled by the
- Update docstring for
enable-cr-debug-metadataconfig option. - Update
TestWorkflowExecutor_HandleFlyteWorkflow_Failingto have sub test cases for combinations ofFAIL_AFTER_EXECUTABLE_NODES_COMPLETEandenable-cr-debug-metadata
How was this patch tested?
Updated unittests I have been running something very similar to this in our prod deployment for some time.
Setup process
Screenshots
Check all the applicable boxes
- [x] I updated the documentation accordingly - I would like to make a follow up PR to update https://docs.flyte.org/en/latest/deployment/configuration/performance.html#improving-etcd-performance
- [x] All new and existing tests passed.
- [x] All commits are signed-off.
Related PRs
Follow up to https://github.com/flyteorg/flyte/pull/4596
Docs link
Codecov Report
Attention: 10 lines in your changes are missing coverage. Please review.
Comparison is base (
0c8dc61) 58.20% compared to head (2cea075) 58.07%. Report is 1 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| flytepropeller/pkg/controller/nodes/executor.go | 11.11% | 8 Missing :warning: |
| ...ler/pkg/apis/flyteworkflow/v1alpha1/node_status.go | 33.33% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #4624 +/- ##
==========================================
- Coverage 58.20% 58.07% -0.14%
==========================================
Files 626 476 -150
Lines 53800 38119 -15681
==========================================
- Hits 31316 22138 -9178
+ Misses 19976 14056 -5920
+ Partials 2508 1925 -583
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 58.26% <23.07%> (+0.05%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
It looks like I'm missing a little bit of code coverage. I'll try to find some time to fix that.
If I understand this logic the goal is to delete all the errors but the last one right? I'm trying to wrap my head around the logic of iterating over downstream nodes but need to dive deeper. Is there determinism in the ordering? Or is there a scenario here where we delete all of the error messages? For example, if we have two nodes (n0 and n1) if the first time we iterate over these the order is n0, n1 then we clear the error from n0 if the second time we iterate n1, n0 then we clear the error from n0 and just cleared all of our errors.
That's a good question. I guess I assumed that stateOnComplete would only get updated when there are new errors and not get randomly set to a different error on each round. I think that assumption is correct given that we've been using this in prod and nobody has noticed any problems, but I will need to read the code again to understand how that is achieved.
Unfortunately I've been very distracted from this recently but I do plan to come back to it.