[BUG] Master server accidently shutdown every day
Describe the bug When zookeeper connection timeout and reconnect to zookeeper, master always down. but zookeeper can be connected (you can see the successful connection in the log.
From the log, it seems master is using the existing zookpeer session object to do the reconnection but those session objects has expired, this lead to master consider zk can not be connected and then shutdown itself. and after that, master open the new zookeeper session which is connected successfully but master still shutdown itself.
Expected behavior I feel this is a bug due to reuse the expired zookeeper session.
Screenshots Please see the log in detail
dolphinscheduler-master.log the bug is.
Which version of Dolphin Scheduler: -[1.2.1]
BTW. worker server has no any zookeeper session expired log and run with no any issues at the same time.
For each expired session, the log behavor:
-
close socket and reconnect [WARN] 2020-06-18 02:20:34.110 org.apache.zookeeper.ClientCnxn:[1108] - Client session timed out, have not heard from server in 61353ms for sessionid 0x572ab3af9972208 [INFO] 2020-06-18 02:20:34.318 org.apache.zookeeper.ClientCnxn:[1156] - Client session timed out, have not heard from server in 61353ms for sessionid 0x572ab3af9972208, closing socket connection and attempting reconnect
-
unable to reconnect due to session expired [INFO] 2020-06-18 02:20:36.107 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to hadoop243/192.192.192.243:2181, initiating session [WARN] 2020-06-18 02:20:36.108 org.apache.zookeeper.ClientCnxn:[1285] - Unable to reconnect to ZooKeeper service, session 0x572ab3af9972208 has expired [INFO] 2020-06-18 02:20:36.108 org.apache.curator.framework.state.ConnectionStateManager:[228] - State change: LOST [WARN] 2020-06-18 02:20:36.108 org.apache.curator.ConnectionState:[336] - Session expired event received [INFO] 2020-06-18 02:20:36.108 org.apache.zookeeper.ClientCnxn:[1154] - Unable to reconnect to ZooKeeper service, session 0x572ab3af9972208 has expired, closing socket connection [INFO] 2020-06-18 02:20:36.200 org.apache.zookeeper.ClientCnxn:[522] - EventThread shut down for session: 0x572ab3af9972208
-
open new sesion to reconnect [INFO] 2020-06-18 02:20:37.118 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server hadoop243/192.192.192.243:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2020-06-18 02:20:37.118 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server hadoop242/192.192.192.242:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2020-06-18 02:20:37.119 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to hadoop242/192.192.192.242:2181, initiating session [INFO] 2020-06-18 02:20:37.119 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to hadoop243/192.192.192.243:2181, initiating session [INFO] 2020-06-18 02:20:37.417 org.apache.zookeeper.ClientCnxn:[1299] - Session establishment complete on server hadoop243/192.192.192.243:2181, sessionid = 0x472c1a29a0602fd, negotiated timeout = 60000 [INFO] 2020-06-18 02:20:37.417 org.apache.curator.framework.state.ConnectionStateManager:[228] - State change: RECONNECTED [INFO] 2020-06-18 02:20:38.882 org.apache.zookeeper.ClientCnxn:[1299] - Session establishment complete on server hadoop242/192.192.192.242:2181, sessionid = 0x372ab3af9952528, negotiated timeout = 60000
-
master shutdown [INFO] 2020-06-18 02:20:38.882 org.apache.dolphinscheduler.server.master.MasterServer:[180] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
here is my zookeeper settting: #dolphinscheduler failover directory zookeeper.session.timeout=60000 zookeeper.connection.timeout=60000 zookeeper.retry.base.sleep=2000 zookeeper.retry.max.sleep=120000 zookeeper.retry.maxtime=29
@lenboo
i have the same issue. about 2:15am everyday,the master and worker node shut down.

any progress?
Does this fix in dev branch @lenboo @caishunfeng
I meet the same error with 2.0.5, any idea? thx.
I meet the same error with 2.0.2
Me too. In 1.3.6
BTW, zookeeper log: 2022-08-03 22:09:49,833 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:231) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:748)
Me too. In 1.3.6。
Can you see the load of the OS at the point of Master Server shutdown? If the OS has a high load for a long time and the master cannot send heartbeat to zk, the master will think that it has lost contact and kill itself.
Since this is a historical issue and there is no specific details, I close this issue. If there is more detailed information, you can reopen it or create new one.