dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Improvement][MasterWorker] Self-recovery when master or worker lost connection from registry center

Open caishunfeng opened this issue 4 years ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Background

Now if a master or worker lost zk connection, it doesn't disconver itself immediately until checking the dead server list. And when it knows that it was judged to dead serve, it will stop itself, without self-recovery.

Prosoal

d264093da134508c29b021d9c57d440

When master lost zk connection:

  1. update current server state to wait reconnect
  2. send server lost connection alert
  3. keep quartz working (it will ensure work normally by quartz and db)
  4. stop accepting new request
  5. stop handling commands and process instances, clear the local running process instances; (it will be take over by other master)
  6. wait to reconnect
  7. when reconnect successfully, send server recover alert, update server state to normal and recover working

628345e6b1dd3cf147f8d40ee35021f

When worker lost zk connection:

  1. update current server state to wait reconnect
  2. send server lost connection alert
  3. kill the running task (it will be task over by master and rerun)
  4. stop accepting new request
  5. wait to reconnect within a certain time
  6. if reconnect timeout, stop itself
  7. when reconnect successfully, send server recover alert, update server state to normal and recover working;

Originally posted by @caishunfeng in https://github.com/apache/dolphinscheduler/discussions/6643#discussioncomment-1706255

Related issues

#7004

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

caishunfeng avatar Nov 27 '21 04:11 caishunfeng

Hi:

  • Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/en-us/community/development/subscribe.html ,Then write the issue URL in the email content and send question to [email protected].

github-actions[bot] avatar Nov 27 '21 04:11 github-actions[bot]

please assign to me

shangeyao avatar Dec 07 '21 02:12 shangeyao

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jan 07 '22 00:01 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Jan 15 '22 00:01 github-actions[bot]