[Improvement][MasterWorker] Self-recovery when master or worker lost connection from registry center
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Background
Now if a master or worker lost zk connection, it doesn't disconver itself immediately until checking the dead server list. And when it knows that it was judged to dead serve, it will stop itself, without self-recovery.
Prosoal

When master lost zk connection:
- update current server state to
wait reconnect - send server lost connection alert
- keep quartz working (it will ensure work normally by quartz and db)
- stop accepting new request
- stop handling commands and process instances, clear the local running process instances; (it will be take over by other master)
- wait to reconnect
- when reconnect successfully, send server recover alert, update server state to
normaland recover working

When worker lost zk connection:
- update current server state to
wait reconnect - send server lost connection alert
- kill the running task (it will be task over by master and rerun)
- stop accepting new request
- wait to reconnect within a certain time
- if reconnect timeout, stop itself
- when reconnect successfully, send server recover alert, update server state to
normaland recover working;
Originally posted by @caishunfeng in https://github.com/apache/dolphinscheduler/discussions/6643#discussioncomment-1706255
Related issues
#7004
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hi:
- Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
- In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
- If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/en-us/community/development/subscribe.html ,Then write the issue URL in the email content and send question to [email protected].
please assign to me
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.