flyte [Core feature] Delete terminated workflows in chunks during garbage collection

Motivation: Why do you think this is important?

A large number of FlyteWorkflow objects may overwhelm Flyte's garbage collection routine. This is because the garbage collector works by first listing all objects in the respective namespaces. This operation will time out in the event that there is a large number of objects in a given namespace.

Given that FlytePropeller watches for new or updated FlyteWorkflow objects in the namespaces assigned to it, when any of these namespaces has a large number of objects, the ListAndWatch operation will timeout as well. This causes the whole workflow engine to grind to a halt when the number of FlyteWorkflow objects blows up beyond what the garbage collector can handle! See below for log:

I0124 16:52:23.295697       1 trace.go:205] Trace[1562460260]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (24-Jan-2022 16:51:53.294) (total time: 30000ms):
Trace[1562460260]: [30.000707142s] [30.000707142s] END
E0124 16:52:23.295723       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.FlyteWorkflow: failed to list *v1alpha1.FlyteWorkflow: Get "https://192.168.3.1:443/apis/flyte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Goal: What should the final outcome look like, ideally?

The garbage collector should limit the number of terminated workflows it lists/deletes every tick. This will avoid timeouts in the event that there are a large number of objects in a given namespace, and as such will be able to complete successfully, even if it may take longer.

Describe alternatives you've considered

We considered a cronjob that manually does chunked deletion of terminated workflows to work around this issue, but we believe that this is better to be fixed in FlytePropeller.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

Feb 12 '22 16:02 jeevb

@jeevb to recap, there are a large number of workflows and FlytePropellers management overwhelms the k8s API server resulting in timeouts. These include not only the informer watch on new FlyteWorkflow CRDs but also in deleting terminated CRDs. The problem is that this issue will persist until deletion works, and the failure in deletion is cyclic.

Do you have the specific GC error? The github permalink you provided looks like it's listing the namespaces so that it can iterate over each and perform that actual deletion here. I think that given the timeouts you're experiencing either of these calls can timeout, but it might make things easier if we know just one of them is.

I think your idea of limiting the number of CRDs deleted on each tick makes sense. However, the limit field in ListOptions is not supported through the delete API. Therefore, we can either delete all CRDs (matching a query on label selectors) or a single one. This makes things a little more difficult.

In looking for a solution I discovered k8s FlowSchema. It looks like it allows assigning priorities to API calls based on simple matching that's enabled by default in v1.20 (and can be enabled manually on previous versions). The idea is that we could create a FlowSchema that sets FlyteWorkflow CRD deletion API calls to a high priority, then those calls would experience fewer (hopefully none) timeouts, thus breaking the cycle you're seeing. Do you think this is something that might work? I would be happy to run some tests.

Mar 02 '22 16:03 hamersaw

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

Aug 28 '23 00:08 github-actions[bot]

Commenting to keep open.

Aug 30 '23 15:08 hamersaw

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏

May 28 '24 00:05 github-actions[bot]

I do not think this is valid anymore as we just simply apply label based deletions and let them Propagate in the background

May 30 '24 04:05 kumare3

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏

Feb 25 '25 00:02 github-actions[bot]

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

May 15 '25 00:05 github-actions[bot]