ZOOKEEPER-4409 NullPointerException in SendAckRequestProcessor
Thanks @zouyonghao, the diff looks good to me, do you mind to share the context where you see this happen? also, please add a simple unit test for this.
Hi @lvfangmin , I got this with a fault injection test including file and network errors. And a simple test case added.
I got this with a fault injection test including file and network errors.
@zouyonghao What injection tool did you use? How can we reproduce this issue? Finding a approach to reproduce this issue will give us a deeper understanding for this bug. One possible transmission mechanism may be(A guess):
the network from follower to leader has broken ---> Follower#followLeader ---> Learner#closeSockSync(set sock = null) ---> (at the same time, the write requests issued to follower) ---> SendAckRequestProcessor#processRequest ---> learner.writePacket(throw IOException due to network issue) ---> learner.sock.close()
@maoling The tool is developed in a research project which has not been published, and this bug can be reproduced by this tool.
I checked the log and it seems your guess is correct.
2021-11-30 16:57:02,955 [myid:2] - WARN [QuorumPeer[myid=2](plain=0.0.0.0:2182)(secure=disabled):Follower@131] - Exception when following the leader
java.net.SocketException: Invalid argument (Write failed)
at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.zookeeper.server.quorum.Learner.writePacketNow(Learner.java:203)
at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:192)
at org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:850)
at org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:163)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:128)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1514)
2021-11-30 16:57:02,958 [myid:2] - INFO [QuorumPeer[myid=2](plain=0.0.0.0:2182)(secure=disabled):Follower@145] - Disconnected from leader (with address: /127.0.1.1:2892). Was connected for 2036ms. Sync state: true
2021-11-30 16:57:02,959 [myid:2] - WARN [SyncThread:2:SendAckRequestProcessor@65] - Closing connection to leader, exception during packet send
java.net.SocketException: Socket closed
at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.zookeeper.server.quorum.Learner.writePacketNow(Learner.java:203)
at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:192)
at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:63)
at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:248)
at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:169)
2021-11-30 16:57:02,961 [myid:2] - ERROR [SyncThread:2:ZooKeeperCriticalThread@49] - Severe unrecoverable error, from thread : SyncThread:2
java.lang.NullPointerException
at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:67)
at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:248)
at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:169)
@zouyonghao In fact, it's not easy for me to reproduce this issue. Sometimes NPE is a symptom not the root cause(e.g., concurrent issue or timing disorder). If you can provide a public, relativity stable approach to reproduce this issue, we can merge this PR asap.
Assuming the cause is as described in https://github.com/apache/zookeeper/pull/1774#issuecomment-993437244 , then I'm not sure this patch would completely prevent NPEs. With this patch, we could have 2 independent threads concurrently calling Learner#closeSockSync():
void closeSockSync() {
try {
long startTime = Time.currentElapsedTime();
if (sock != null) {
sock.close();
sock = null;
}
ServerMetrics.getMetrics().SOCKET_CLOSING_TIME.add(Time.currentElapsedTime() - startTime);
} catch (IOException e) {
LOG.warn("Ignoring error closing connection to leader", e);
}
}
I don't believe there is any locking in effect around this operation, so it would be possible that both thread 1 and thread 2 make it past the if (sock != null) check, thread 1 closes and nulls the variable, and then thread 2 gets the NPE when it attempts sock.close().
@cnauroth You are right, maybe we should make this method synchronized?
Hi @maolin and all in this thread, my fuzzing framework now supports record/replay with the support of rr debugger! I have recorded this bug with my modified rr version, you can download the file and try it with rr replay rr_rec_0_0 on Ubuntu 22.04.2 LTS.
rr_rec_0_0.zip
By the way, I want to know your opinions about this feature :)
Hi,
I wonder if someone can continue the review of this issue which has been stuck for a while.
Besides, is it possible to request a CVE for this issue, which would benefit my research greatly.
Maybe it's not a threads concurrently issue,just network broken from follower to leader (As @maoling guessed,I think so too),learner.sock has been closed or assigned to null in Follower#followLeader or somewhere,It is possible so i think just check if (learner.sock != null) in SendAckRequestProcessor#flush,It's compatible and not bad more due to SendAckRequestProcessor#flush has been catched the NPE,but an exceptions always confusing. BTW,make Learner#closeSockSync() is synchronized doesn't make any sense(not sure Learner with in race condition so far)
Ok @w6et , it looks fine now.
@anmolnar fyi
Hey guys @eolivelli @lvfangmin @maoling @anmolnar , can we merge this?
I've run into the same issue with my clickhouse cluster, occasionally I'll get a few of these from the same zookeeper instance
I have seen this problem today in some productrion cluster.
@zouyonghao I haven't noticed that the patch is not targeting the "master" branch. Could you please create a new PR against the "master" branch ? Then we will cherry-pick to the active branches
@eolivelli Hi, I created a new PR https://github.com/apache/zookeeper/pull/2163
Hi @eolivelli , I was wondering if you actually got a repro for it and could try applying the patch in a test environment. I think last time we looked at this patch, no one had a repro.
Closing this since it's merged into master https://github.com/apache/zookeeper/pull/2163