flyte Clear past errors from workflow state

Tracking issue

https://github.com/flyteorg/flyte/issues/4569

Why are the changes needed?

Reduce un-needed information stored in etcd when using failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. This allows flyte to scale to larger workflows before hitting etcd size limits.

What changes were proposed in this pull request?

Clears old errors whenever new ones occur
- This is controlled by the node-config.enable-cr-debug-metadata config option. Set this to true to restore the previous behaviour.
- Behaviour is only changed when running workflows with FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. Without this the workflow will fail as soon as there is one failure so there can never be more than one error regardless of this PR.
Update docstring for enable-cr-debug-metadata config option.
Update TestWorkflowExecutor_HandleFlyteWorkflow_Failing to have sub test cases for combinations of FAIL_AFTER_EXECUTABLE_NODES_COMPLETE and enable-cr-debug-metadata

How was this patch tested?

Updated unittests I have been running something very similar to this in our prod deployment for some time.

Setup process

Screenshots

Check all the applicable boxes

[x] I updated the documentation accordingly - I would like to make a follow up PR to update https://docs.flyte.org/en/latest/deployment/configuration/performance.html#improving-etcd-performance
[x] All new and existing tests passed.
[x] All commits are signed-off.

Related PRs

Follow up to https://github.com/flyteorg/flyte/pull/4596

Docs link

Dec 19 '23 16:12 Tom-Newton

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (0c8dc61) 58.20% compared to head (2cea075) 58.07%. Report is 1 commits behind head on master.

Files	Patch %	Lines
flytepropeller/pkg/controller/nodes/executor.go	11.11%	8 Missing :warning:
...ler/pkg/apis/flyteworkflow/v1alpha1/node_status.go	33.33%	2 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4624      +/-   ##
==========================================
- Coverage   58.20%   58.07%   -0.14%     
==========================================
  Files         626      476     -150     
  Lines       53800    38119   -15681     
==========================================
- Hits        31316    22138    -9178     
+ Misses      19976    14056    -5920     
+ Partials     2508     1925     -583

Flag	Coverage Δ
unittests	`58.26% <23.07%> (+0.05%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jan 17 '24 18:01 codecov[bot]

It looks like I'm missing a little bit of code coverage. I'll try to find some time to fix that.

Jan 17 '24 21:01 Tom-Newton

If I understand this logic the goal is to delete all the errors but the last one right? I'm trying to wrap my head around the logic of iterating over downstream nodes but need to dive deeper. Is there determinism in the ordering? Or is there a scenario here where we delete all of the error messages? For example, if we have two nodes (n0 and n1) if the first time we iterate over these the order is n0, n1 then we clear the error from n0 if the second time we iterate n1, n0 then we clear the error from n0 and just cleared all of our errors.

That's a good question. I guess I assumed that stateOnComplete would only get updated when there are new errors and not get randomly set to a different error on each round. I think that assumption is correct given that we've been using this in prod and nobody has noticed any problems, but I will need to read the code again to understand how that is achieved.

Jan 18 '24 19:01 Tom-Newton

Unfortunately I've been very distracted from this recently but I do plan to come back to it.

Feb 05 '24 20:02 Tom-Newton