zookeeper ZOOKEEPER-4409 NullPointerException in SendAckRequestProcessor

Nov 01 '21 09:11 zouyonghao

Thanks @zouyonghao, the diff looks good to me, do you mind to share the context where you see this happen? also, please add a simple unit test for this.

Nov 05 '21 18:11 lvfangmin

Hi @lvfangmin , I got this with a fault injection test including file and network errors. And a simple test case added.

Nov 23 '21 08:11 zouyonghao

I got this with a fault injection test including file and network errors.

@zouyonghao What injection tool did you use? How can we reproduce this issue? Finding a approach to reproduce this issue will give us a deeper understanding for this bug. One possible transmission mechanism may be(A guess):

the network from follower to leader has broken ---> Follower#followLeader ---> Learner#closeSockSync(set sock = null) ---> (at the same time, the write requests issued to follower) ---> SendAckRequestProcessor#processRequest ---> learner.writePacket(throw IOException due to network issue) ---> learner.sock.close()

Dec 14 '21 11:12 maoling

@maoling The tool is developed in a research project which has not been published, and this bug can be reproduced by this tool.

I checked the log and it seems your guess is correct.

2021-11-30 16:57:02,955 [myid:2] - WARN  [QuorumPeer[myid=2](plain=0.0.0.0:2182)(secure=disabled):Follower@131] - Exception when following the leader
java.net.SocketException: Invalid argument (Write failed)
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
        at org.apache.zookeeper.server.quorum.Learner.writePacketNow(Learner.java:203)
        at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:192)
        at org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:850)
        at org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:163)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:128)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1514)
2021-11-30 16:57:02,958 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2182)(secure=disabled):Follower@145] - Disconnected from leader (with address: /127.0.1.1:2892). Was connected for 2036ms. Sync state: true
2021-11-30 16:57:02,959 [myid:2] - WARN  [SyncThread:2:SendAckRequestProcessor@65] - Closing connection to leader, exception during packet send
java.net.SocketException: Socket closed
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
        at org.apache.zookeeper.server.quorum.Learner.writePacketNow(Learner.java:203)
        at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:192)
        at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:63)
        at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:248)
        at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:169)
2021-11-30 16:57:02,961 [myid:2] - ERROR [SyncThread:2:ZooKeeperCriticalThread@49] - Severe unrecoverable error, from thread : SyncThread:2
java.lang.NullPointerException
        at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:67)
        at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:248)
        at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:169)

Dec 15 '21 09:12 zouyonghao

@zouyonghao In fact, it's not easy for me to reproduce this issue. Sometimes NPE is a symptom not the root cause(e.g., concurrent issue or timing disorder). If you can provide a public, relativity stable approach to reproduce this issue, we can merge this PR asap.

Dec 24 '21 10:12 maoling

Assuming the cause is as described in https://github.com/apache/zookeeper/pull/1774#issuecomment-993437244 , then I'm not sure this patch would completely prevent NPEs. With this patch, we could have 2 independent threads concurrently calling Learner#closeSockSync():

    void closeSockSync() {
        try {
            long startTime = Time.currentElapsedTime();
            if (sock != null) {
                sock.close();
                sock = null;
            }
            ServerMetrics.getMetrics().SOCKET_CLOSING_TIME.add(Time.currentElapsedTime() - startTime);
        } catch (IOException e) {
            LOG.warn("Ignoring error closing connection to leader", e);
        }
    }

I don't believe there is any locking in effect around this operation, so it would be possible that both thread 1 and thread 2 make it past the if (sock != null) check, thread 1 closes and nulls the variable, and then thread 2 gets the NPE when it attempts sock.close().

Oct 28 '22 22:10 cnauroth

@cnauroth You are right, maybe we should make this method synchronized?

Apr 14 '23 08:04 zouyonghao

Hi @maolin and all in this thread, my fuzzing framework now supports record/replay with the support of rr debugger! I have recorded this bug with my modified rr version, you can download the file and try it with rr replay rr_rec_0_0 on Ubuntu 22.04.2 LTS. rr_rec_0_0.zip

By the way, I want to know your opinions about this feature :)

Jun 07 '23 16:06 zouyonghao

Hi,

I wonder if someone can continue the review of this issue which has been stuck for a while.

Besides, is it possible to request a CVE for this issue, which would benefit my research greatly.

Jun 28 '23 03:06 zouyonghao

Maybe it's not a threads concurrently issue,just network broken from follower to leader (As @maoling guessed，I think so too)，learner.sock has been closed or assigned to null in Follower#followLeader or somewhere,It is possible so i think just check if (learner.sock != null) in SendAckRequestProcessor#flush,It's compatible and not bad more due to SendAckRequestProcessor#flush has been catched the NPE,but an exceptions always confusing. BTW,make Learner#closeSockSync() is synchronized doesn't make any sense(not sure Learner with in race condition so far)

Jun 29 '23 18:06 w6et

Ok @w6et , it looks fine now.

Jul 17 '23 11:07 zouyonghao

@anmolnar fyi

Jul 20 '23 22:07 eolivelli

Hey guys @eolivelli @lvfangmin @maoling @anmolnar , can we merge this?

Oct 20 '23 13:10 zouyonghao

I've run into the same issue with my clickhouse cluster, occasionally I'll get a few of these from the same zookeeper instance

Dec 03 '23 22:12 Savid

I have seen this problem today in some productrion cluster.

@zouyonghao I haven't noticed that the patch is not targeting the "master" branch. Could you please create a new PR against the "master" branch ? Then we will cherry-pick to the active branches

May 01 '24 07:05 eolivelli

@eolivelli Hi, I created a new PR https://github.com/apache/zookeeper/pull/2163

May 01 '24 09:05 zouyonghao

Hi @eolivelli , I was wondering if you actually got a repro for it and could try applying the patch in a test environment. I think last time we looked at this patch, no one had a repro.

May 01 '24 17:05 cnauroth

Closing this since it's merged into master https://github.com/apache/zookeeper/pull/2163

May 02 '24 11:05 zouyonghao