odc icon indicating copy to clipboard operation
odc copied to clipboard

fix(taskframework): job is not started in resource idle node when deployed multi-nodes

Open krihy opened this issue 1 year ago • 5 comments

What type of PR is this?

type-bug

What this PR does / why we need it:

job is not started when multi nodes has enough resource, when JobStore acquire trigger but resource is not available , we should release it and give a chance to other node acquire trigger.

Which issue(s) this PR fixes:

Special notes for your reviewer:

test case with two nodes, task be started in equalization in two nodes. image

Additional documentation e.g., usage docs, etc.:


krihy avatar May 11 '24 11:05 krihy

please add more info to tell the reviewer that how does this issue happened and how do you solve it, I can get the issue and solution by your code @krihy

yhilmare avatar May 13 '24 09:05 yhilmare

I can not accept your solution:

  1. StartPreparingJob is a daemon job on every node, we'd better not disable it on a specific node
  2. we have already got rateLimiter MonitorProcessRateLimiter, why does it work?
  3. Will other nodes be affected if you releaseAcquiredTrigger in ResourceDetectJobStore?

yhilmare avatar May 13 '24 09:05 yhilmare

I can not accept your solution:

  1. StartPreparingJob is a daemon job on every node, we'd better not disable it on a specific node
  2. we have already got rateLimiter MonitorProcessRateLimiter, why does it work?
  3. Will other nodes be affected if you releaseAcquiredTrigger in ResourceDetectJobStore?

in cluster model, all quartz node seize trigger, and be fired exact only once, if one node with no resource acquire trigger and fired it, the job will not be started in StartPreparingJob in QuartzSchedulerThread while loop and other node with enough resource will not acquire trigger

krihy avatar May 13 '24 10:05 krihy

I can not accept your solution:

  1. StartPreparingJob is a daemon job on every node, we'd better not disable it on a specific node
  2. we have already got rateLimiter MonitorProcessRateLimiter, why does it work?
  3. Will other nodes be affected if you releaseAcquiredTrigger in ResourceDetectJobStore?

in cluster model, all quartz node seize trigger, and be fired exact only once, if one node with no resource acquire trigger and fired it, the job will not be started in StartPreparingJob in QuartzSchedulerThread while loop and other node with enough resource will not acquire trigger

as we discused before, in cluster model, each node should do StartPreparingJob

yizhouxw avatar May 13 '24 10:05 yizhouxw

I can not accept your solution:

  1. StartPreparingJob is a daemon job on every node, we'd better not disable it on a specific node
  2. we have already got rateLimiter MonitorProcessRateLimiter, why does it work?
  3. Will other nodes be affected if you releaseAcquiredTrigger in ResourceDetectJobStore?

in cluster model, all quartz node seize trigger, and be fired exact only once, if one node with no resource acquire trigger and fired it, the job will not be started in StartPreparingJob in QuartzSchedulerThread while loop and other node with enough resource will not acquire trigger

I can not accept your solution:

  1. StartPreparingJob is a daemon job on every node, we'd better not disable it on a specific node
  2. we have already got rateLimiter MonitorProcessRateLimiter, why does it work?
  3. Will other nodes be affected if you releaseAcquiredTrigger in ResourceDetectJobStore?

in cluster model, all quartz node seize trigger, and be fired exact only once, if one node with no resource acquire trigger and fired it, the job will not be started in StartPreparingJob in QuartzSchedulerThread while loop and other node with enough resource will not acquire trigger

so, why doesn't we rely on MonitorProcessRateLimiter but disable the StartPreparingJob?

yhilmare avatar May 13 '24 11:05 yhilmare