incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Bug][Cronjobs] Devlake engine does not respect cronjob schedules

Open AlexPerkbox opened this issue 1 year ago • 14 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

Cronjobs are no longer considered and blueprints do not run recurringly based on a schedule.

E.g. This blueprint is supposed to run daily at midnight (3:00AM my local time) but it did not on June 14, 17, and 19. { "name": "DevOps-Demo-Blueprint", "projectName": "DevOps-Demo", "cronConfig": "0 0 * * *", "createdAt": "2024-06-12T09:50:01.797Z", "updatedAt": "2024-06-12T09:50:27.466Z" } image

Other than that, there are some 400 odd blueprints and when changing cronConfig (using API) only a handful of them actually get triggered (although they return 200 OK).

Running devlake-1.0.0-beta11 on AWS ECS/RDS and byog (bring your own grafana)

What do you expect to happen

Whenever a cronjob is updated a pipeline should trigger at the specified date and time.

How to reproduce

Update cronConfig setting either via UI or API and monitor behaviour.

Anything else

No response

Version

devlake-1.0.0-beta11

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

AlexPerkbox avatar Jun 20 '24 12:06 AlexPerkbox

Hey @AlexPerkbox! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve your issue while we wait for a human maintainer to assist.

The issue you're experiencing with cronjob schedules not being respected in DevLake could be due to improper configuration of the synchronization policy. Specifically, the sync frequency or cron code might not be set correctly. Ensure that the sync frequency or cron code is correctly configured in the synchronization policy to enable recurring runs of the blueprints. Additionally, verify that the "Skip failed tasks" option is appropriately set to prevent interruptions in the data collection process [1][2][3][4].

There are also known performance issues with the devlake-1.0.0-beta11 version that could affect the execution of cronjobs in high-load scenarios. For instance, MySQL InnoDB might fail with the message "The total number of locks exceeds the lock table size" when dealing with large repositories or boards. To mitigate this, you can increase the innodb_buffer_pool_size in your MySQL configuration [5][6][7].

Additionally, there have been similar issues reported and resolved in the past:

  1. [Bug][framework] The blueprint cronjobs get schedule multiple times - This issue was closed by a pull request: fix: ReloadBlueprints is not safe.
  2. [Bug][Config-ui] Manual blueprint still has some info that should be in daily mode - This issue was resolved by pull request #4047 [8][9].

For further assistance, you can refer to the Apache DevLake Config UI tutorial, which provides detailed steps on setting up data connections, creating projects, and configuring synchronization policies. If issues persist, consider checking the troubleshooting section or creating a bug report on the GitHub repository with relevant details such as error messages and logs.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Jun 20 '24 12:06 dosubot[bot]

@AlexPerkbox Thank you for your feedback.

  1. About the cronjob. It's weird. Can you provide more details about what you have done(step by step)? we need a stable reproduction.
  2. About the trigger. a. You have many blueprints(projects) b. You update their conjob configs via API, them all return status code 200 OK c. Only a handful of them are triggered successfully(when the cronjob time schedule is reached) Am I right ?

d4x1 avatar Jun 21 '24 02:06 d4x1

@AlexPerkbox You can set environment PIPELINE_MAX_PARALLEL a bigger number and see what will happen.

d4x1 avatar Jun 21 '24 03:06 d4x1

@d4x1 Thank you for your response.

As part of the initial setup, there were a few hundred blueprints created in the form of one blueprint = one gitlab project. It was observed that a complete run to fetch all data from gitlab takes approximately 2 days so the best time to kick-off (schedule) all jobs is Friday afternoon. Bulk-setting cronConfig in the form of

curl --location --request PATCH "https://devlake.my.endpoint.com/api/blueprints/[1-500]" \
--header 'Content-Type: application/json' \
--data '{
    "cronConfig": "10 15 * * 5"
}' | jq

updates all blueprints fine, but when the time comes there is nothing scheduled/created/pending in the pipelines tab.

P.s Increasing PIPELINE_MAX_PARALLEL to 5 does not make a difference.

AlexPerkbox avatar Jun 21 '24 06:06 AlexPerkbox

I suspect that the ECS is the cause of the problem. According to the document on https://aws.amazon.com/ecs/ , containers are fully managed by AWS, maybe it would put containers to sleep when it is not active(processing requests, or using any CPU time). I could not find any concrete proof at this point, but I might do it to save resources(money) if I were AWS.

klesh avatar Jun 24 '24 06:06 klesh

I have been seeing this same kind of problem. I have quite a few pipelines, all configured to run daily at 01:00, I find only 5 of them seem to run. there are no obvious errors or reasons why the others do not execute. the 5 that do run don't seem to be consistent. I'm running inside kubernetes

wouldd avatar Jun 24 '24 10:06 wouldd

We are also using k8s (EKS) as well, and all pipelines get executed as expected, we also have some k3s clusters for testing and other purposes, and they are all working fine 😂. The chance of this being fixed is slim unless someone who is having the problem could dig deep into the root.

klesh avatar Jun 25 '24 01:06 klesh

@klesh if you could give me a couple of pointers at the code which is responsible for triggering these scheduled jobs etc. then I'd be happy to dig in, maybe add some daignostic logging to my local fork.

wouldd avatar Jun 25 '24 08:06 wouldd

@wouldd Sure thing, we are using github.com/robfig/cron for scheduling jobs and here is the code calling the lib when creating/updating any blueprint, while the ReloadBlueprints function is called on the startup to load all blueprints from the db.

here is where the cronManager gets initialized after migrations were done, nothing fancy.

Let me know if I missed anything.

klesh avatar Jun 25 '24 13:06 klesh

ah, well now that I know what logs I should be looking for. what I see is this: time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[1][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec4fd8 0xc002ec4ff0 0xc002ec5008] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[1][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec4fd8 0xc002ec4ff0 0xc002ec5008] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[21][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5068 0xc002ec5080 0xc002ec5098] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[21][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5068 0xc002ec5080 0xc002ec5098] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[3][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef53f8 0xc002ef5410 0xc002ef5428] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[3][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef53f8 0xc002ef5410 0xc002ef5428] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[2][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef55f0 0xc002ef5608 0xc002ef5620] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[2][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef55f0 0xc002ef5608 0xc002ef5620] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[12][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5728 0xc002ec5740 0xc002ec5758] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[12][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5728 0xc002ec5740 0xc002ec5758] (504)"

I have 25 blueprints total. the UI shows them all 'scheduled' to run at the next expected schedule, however clearly they mostly fail to actually create in the cron scheduler due to table locks.

wouldd avatar Jun 25 '24 14:06 wouldd

I would confirm the same behaviour as @wouldd, here is some logs from AWS CloudWatch

2024-06-25T11:30:00.107+03:00
time="2024-06-25 08:30:00" level=error msg=" [blueprint] run cron job failed on blueprint:[495][<gitlab-project-slug>]
caused by: Error 1040: Too many connections (500)
Wraps: (2) Error 1040: Too many connections
Error types: (1) *hintdetail.withDetail (2) *mysql.MySQLError"

AlexPerkbox avatar Jun 25 '24 15:06 AlexPerkbox

I would confirm the same behaviour as @wouldd, here is some logs from AWS CloudWatch

2024-06-25T11:30:00.107+03:00
time="2024-06-25 08:30:00" level=error msg=" [blueprint] run cron job failed on blueprint:[495][<gitlab-project-slug>]
caused by: Error 1040: Too many connections (500)
Wraps: (2) Error 1040: Too many connections
Error types: (1) *hintdetail.withDetail (2) *mysql.MySQLError"

@AlexPerkbox You can check AWS's config, turn up the value of something like max_connection_count.

d4x1 avatar Jun 28 '24 02:06 d4x1

@wouldd Aah, I see. Too many bps are trying to create pipelines simultaneously, causing some table lockers to timeout.

klesh avatar Jun 28 '24 08:06 klesh

@wouldd Could you spread them to different times and see if the problem goes away?

klesh avatar Jun 28 '24 08:06 klesh

@klesh I spred about half of them out last week and all those ran fine. I've now spread the remainder out. it's obviously a bit of a pain to have to do this manually. but certainly seems to work around the issue.

wouldd avatar Jul 01 '24 16:07 wouldd

@wouldd Thanks for the information. Will fix it on v1.1

klesh avatar Jul 02 '24 03:07 klesh

@wouldd @AlexPerkbox

We have a new release, https://github.com/apache/incubator-devlake/releases/tag/v1.0.1-beta2, you can have a try.

d4x1 avatar Jul 16 '24 02:07 d4x1

@wouldd @AlexPerkbox How about the latest release ?

d4x1 avatar Aug 09 '24 03:08 d4x1

Hey @wouldd, I am about to give the latest beta release a try, will let you all know how it goes. Cheers

AlexPerkbox avatar Aug 13 '24 08:08 AlexPerkbox

I can confirm that all 400+ blueprints are scheduled and thus created (pending) at the same time, no issues whatsoever. Noticed that all pipeline pages display results now, it used to be up to 4th before. Well done team, much appreciated!

AlexPerkbox avatar Aug 15 '24 07:08 AlexPerkbox