[BUG] invalid cron date schedule creates infinite loop in flytescheduler
Describe the bug
We were experiencing memory issue with our flytescheduler, as the memory kept hiking up until the pod would die because of OOMKilled. Everytime the pod was restarted, it tried to “catch up” the same schedule over and over again, even tho there was a log from the previous cycle which said, "caught up successfully on the schedule", then we found out, that the snapshot in the schedule_entities_snapshots wasn't being updated
We realized then, there was some launchplans which never got a “successfully caught up”, this launchplans were scheduled to 0 2 31 * * or to every 31st February (our user set this deliberately, expecting that the schedule to never be executed). Turned out, this was the schedule that started the memory issue
If you check this, the loop will quit once scheduledTime > now,
https://github.com/flyteorg/flyte/blob/8bf1de63810474f3ed6bbc4c71caf841012383fe/flyteadmin/scheduler/core/gocron_scheduler.go#L224-L239
but the cron library has an interesting behavior, if the cron expression is valid, but the date is invalid, it will not return any error, and instead will handle this by returning a default time.Time{} or 0001:01:01 as the date. This resulted in this block of code to never quit as scheduledTime will always be less than now, and exhaust flytescheduler's memory
As this block never quit, the snapshoter initialization in here, was never executed because the initialization wait for the missed schedules to catch up or failing, and therefore it didn't take anymore snapshot and our flytescheduler kept trying to catch up the schedule from the same point
Expected behavior
flytescheduler will not try to catch up on schedule with an invalid cron date
Additional context to reproduce
No response
Screenshots
The log where flytescheduler tried to catchup the same schedule from the same time and got "caught up successfully on the schedule":
Our flytescheduler memory:
Are you sure this issue hasn't been raised already?
- [x] Yes
Have you read the Code of Conduct?
- [x] Yes
Thank you for opening your first issue here! 🛠
It seems likely that Flyte's scheduler can enter an infinite loop and exhaust memory if a launch plan is scheduled with a cron expression that is syntactically valid but semantically impossible (e.g., "0 2 31 * *" for the 31st day of every month, which doesn't exist for all months). The scheduler's catch-up logic repeatedly tries to compute the next scheduled time, but the underlying cron library returns a zero time (0001-01-01T00:00:00Z) for such cases, causing the loop to never exit since this value is always less than the current time. As a result, the snapshot is never updated, and the scheduler keeps retrying, leading to memory issues. This behavior is visible in the code at gocron_scheduler.go#L224-L239 and schedule_executor.go#L85-L100.
My understanding is that Flyte currently does not validate for impossible cron dates at registration or catch-up time, so such schedules can cause the described infinite loop. A workaround would be to avoid using impossible cron expressions, and a fix would require Flyte to detect and handle this case, possibly by checking for zero time returns from the cron parser and aborting the catch-up. Would you like a step-by-step breakdown of the relevant code paths and possible mitigation strategies?
Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.
Mention @runllm to ask follow-up questions. See runllm.com for more!
#take
Hi @bthari , I’m currently looking into this issue. I'm wondering what is the use case where users intentionally input an invalid cron expecting it to disable the schedule? Personally, I think Flyte scheduler should validate the cron expression and date, and raise an error if either is invalid.
@popojk you're right, there were expecting the schedule not to run. We have asked them to deactivate the launchplan instead to avoid this, but it would be nice it flyte will throw an error or warning at the least for this case, so we know which part of our config that needs changes