dolphinscheduler [BUG] Master server accidently shutdown every day

Describe the bug When zookeeper connection timeout and reconnect to zookeeper, master always down. but zookeeper can be connected (you can see the successful connection in the log.

From the log, it seems master is using the existing zookpeer session object to do the reconnection but those session objects has expired, this lead to master consider zk can not be connected and then shutdown itself. and after that, master open the new zookeeper session which is connected successfully but master still shutdown itself.

Expected behavior I feel this is a bug due to reuse the expired zookeeper session.

Screenshots Please see the log in detail

dolphinscheduler-master.log the bug is.

Which version of Dolphin Scheduler: -[1.2.1]

Jun 18 '20 01:06 GabrielWithTina

BTW. worker server has no any zookeeper session expired log and run with no any issues at the same time.

Jun 18 '20 01:06 GabrielWithTina

For each expired session, the log behavor:

close socket and reconnect [WARN] 2020-06-18 02:20:34.110 org.apache.zookeeper.ClientCnxn:[1108] - Client session timed out, have not heard from server in 61353ms for sessionid 0x572ab3af9972208 [INFO] 2020-06-18 02:20:34.318 org.apache.zookeeper.ClientCnxn:[1156] - Client session timed out, have not heard from server in 61353ms for sessionid 0x572ab3af9972208, closing socket connection and attempting reconnect
unable to reconnect due to session expired [INFO] 2020-06-18 02:20:36.107 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to hadoop243/192.192.192.243:2181, initiating session [WARN] 2020-06-18 02:20:36.108 org.apache.zookeeper.ClientCnxn:[1285] - Unable to reconnect to ZooKeeper service, session 0x572ab3af9972208 has expired [INFO] 2020-06-18 02:20:36.108 org.apache.curator.framework.state.ConnectionStateManager:[228] - State change: LOST [WARN] 2020-06-18 02:20:36.108 org.apache.curator.ConnectionState:[336] - Session expired event received [INFO] 2020-06-18 02:20:36.108 org.apache.zookeeper.ClientCnxn:[1154] - Unable to reconnect to ZooKeeper service, session 0x572ab3af9972208 has expired, closing socket connection [INFO] 2020-06-18 02:20:36.200 org.apache.zookeeper.ClientCnxn:[522] - EventThread shut down for session: 0x572ab3af9972208
open new sesion to reconnect [INFO] 2020-06-18 02:20:37.118 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server hadoop243/192.192.192.243:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2020-06-18 02:20:37.118 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server hadoop242/192.192.192.242:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2020-06-18 02:20:37.119 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to hadoop242/192.192.192.242:2181, initiating session [INFO] 2020-06-18 02:20:37.119 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to hadoop243/192.192.192.243:2181, initiating session [INFO] 2020-06-18 02:20:37.417 org.apache.zookeeper.ClientCnxn:[1299] - Session establishment complete on server hadoop243/192.192.192.243:2181, sessionid = 0x472c1a29a0602fd, negotiated timeout = 60000 [INFO] 2020-06-18 02:20:37.417 org.apache.curator.framework.state.ConnectionStateManager:[228] - State change: RECONNECTED [INFO] 2020-06-18 02:20:38.882 org.apache.zookeeper.ClientCnxn:[1299] - Session establishment complete on server hadoop242/192.192.192.242:2181, sessionid = 0x372ab3af9952528, negotiated timeout = 60000
master shutdown [INFO] 2020-06-18 02:20:38.882 org.apache.dolphinscheduler.server.master.MasterServer:[180] - master server is stopping ..., cause : i was judged to death, release resources and stop myself

Jun 18 '20 01:06 GabrielWithTina

here is my zookeeper settting: #dolphinscheduler failover directory zookeeper.session.timeout=60000 zookeeper.connection.timeout=60000 zookeeper.retry.base.sleep=2000 zookeeper.retry.max.sleep=120000 zookeeper.retry.maxtime=29

Jun 18 '20 02:06 GabrielWithTina

@lenboo

Jun 18 '20 09:06 xingchun-chen

i have the same issue. about 2:15am everyday,the master and worker node shut down.

Nov 13 '20 06:11 Sam--Shan

any progress?

Sep 16 '21 09:09 HarborZeng

Does this fix in dev branch @lenboo @caishunfeng

Jan 04 '22 10:01 zhongjiajie

I meet the same error with 2.0.5, any idea? thx.

May 21 '22 07:05 bluerocly

I meet the same error with 2.0.2

Aug 04 '22 00:08 dik111

Me too. In 1.3.6

Aug 04 '22 03:08 ryze124

BTW, zookeeper log: 2022-08-03 22:09:49,833 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:231) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:748)

Aug 04 '22 03:08 ryze124

Me too. In 1.3.6。

Sep 19 '22 01:09 chengqiZ

Can you see the load of the OS at the point of Master Server shutdown? If the OS has a high load for a long time and the master cannot send heartbeat to zk, the master will think that it has lost contact and kill itself.

Sep 20 '22 12:09 brave-lee

Since this is a historical issue and there is no specific details, I close this issue. If there is more detailed information, you can reopen it or create new one.

Sep 26 '22 01:09 brave-lee