flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] invalid cron date schedule creates infinite loop in flytescheduler

Open bthari opened this issue 8 months ago • 5 comments

Describe the bug

We were experiencing memory issue with our flytescheduler, as the memory kept hiking up until the pod would die because of OOMKilled. Everytime the pod was restarted, it tried to “catch up” the same schedule over and over again, even tho there was a log from the previous cycle which said, "caught up successfully on the schedule", then we found out, that the snapshot in the schedule_entities_snapshots wasn't being updated

We realized then, there was some launchplans which never got a “successfully caught up”, this launchplans were scheduled to 0 2 31 * * or to every 31st February (our user set this deliberately, expecting that the schedule to never be executed). Turned out, this was the schedule that started the memory issue

If you check this, the loop will quit once scheduledTime > now, https://github.com/flyteorg/flyte/blob/8bf1de63810474f3ed6bbc4c71caf841012383fe/flyteadmin/scheduler/core/gocron_scheduler.go#L224-L239 but the cron library has an interesting behavior, if the cron expression is valid, but the date is invalid, it will not return any error, and instead will handle this by returning a default time.Time{} or 0001:01:01 as the date. This resulted in this block of code to never quit as scheduledTime will always be less than now, and exhaust flytescheduler's memory

As this block never quit, the snapshoter initialization in here, was never executed because the initialization wait for the missed schedules to catch up or failing, and therefore it didn't take anymore snapshot and our flytescheduler kept trying to catch up the schedule from the same point

Expected behavior

flytescheduler will not try to catch up on schedule with an invalid cron date

Additional context to reproduce

No response

Screenshots

The log where flytescheduler tried to catchup the same schedule from the same time and got "caught up successfully on the schedule": Image

Our flytescheduler memory:

Image

Are you sure this issue hasn't been raised already?

  • [x] Yes

Have you read the Code of Conduct?

  • [x] Yes

bthari avatar May 27 '25 17:05 bthari

Thank you for opening your first issue here! 🛠

welcome[bot] avatar May 27 '25 17:05 welcome[bot]

It seems likely that Flyte's scheduler can enter an infinite loop and exhaust memory if a launch plan is scheduled with a cron expression that is syntactically valid but semantically impossible (e.g., "0 2 31 * *" for the 31st day of every month, which doesn't exist for all months). The scheduler's catch-up logic repeatedly tries to compute the next scheduled time, but the underlying cron library returns a zero time (0001-01-01T00:00:00Z) for such cases, causing the loop to never exit since this value is always less than the current time. As a result, the snapshot is never updated, and the scheduler keeps retrying, leading to memory issues. This behavior is visible in the code at gocron_scheduler.go#L224-L239 and schedule_executor.go#L85-L100.

My understanding is that Flyte currently does not validate for impossible cron dates at registration or catch-up time, so such schedules can cause the described infinite loop. A workaround would be to avoid using impossible cron expressions, and a fix would require Flyte to detect and handle this case, possibly by checking for zero time returns from the cron parser and aborting the catch-up. Would you like a step-by-step breakdown of the relevant code paths and possible mitigation strategies?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

runllm[bot] avatar May 27 '25 17:05 runllm[bot]

#take

popojk avatar Jul 29 '25 09:07 popojk

Hi @bthari , I’m currently looking into this issue. I'm wondering what is the use case where users intentionally input an invalid cron expecting it to disable the schedule? Personally, I think Flyte scheduler should validate the cron expression and date, and raise an error if either is invalid.

popojk avatar Aug 04 '25 02:08 popojk

@popojk you're right, there were expecting the schedule not to run. We have asked them to deactivate the launchplan instead to avoid this, but it would be nice it flyte will throw an error or warning at the least for this case, so we know which part of our config that needs changes

bthari avatar Aug 07 '25 06:08 bthari