hadoop icon indicating copy to clipboard operation
hadoop copied to clipboard

HDFS-17357. NioInetPeer.close() should close socket connection.

Open LiuGuH opened this issue 2 years ago • 6 comments

Description of PR

JIRA: HDFS-17357

NioInetPeer.close() now do not close socket connection.

And I found 3w+ connections leakage in datanode . And I found many warn message as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: hostname:50010:DataXceiverServer
java.io.IOException: Xceiver count 8198 exceeds the limit of concurrent xcievers: 8192
        at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:234)
        at java.lang.Thread.run(Thread.java:748)

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer) -> Peer.close() -> NioInetPeer.close()

But NioInetPeer.close() is not invoked with close socket connection. And this will lead to connection leakage.

And Other subClass of Peer's close() is implemented with socket.close(). See EncryptedPeer, DomainPeer, BasicInetPeer

This solution can be reporduced as following: (1) Client write data to HDFS (2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the new Xceiver will fail and throw IOException . And the socket will not release. (3) Client crash for that no new data will be added or client.close is executed. (4) There will be socket connection leakage between datanodes.

The connection leakage like this dn1 dn1:57042 dn2:50010 ESTABLISHED

dn2 dn2:50010 dn1:57042 ESTABLISHED

LiuGuH avatar Jan 26 '24 06:01 LiuGuH

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 0m 20s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
-1 :x: test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 :green_heart: mvninstall 31m 13s trunk passed
+1 :green_heart: compile 0m 32s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: compile 0m 28s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: checkstyle 0m 21s trunk passed
+1 :green_heart: mvnsite 0m 33s trunk passed
+1 :green_heart: javadoc 0m 31s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: spotbugs 1m 25s trunk passed
+1 :green_heart: shadedclient 20m 8s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 :green_heart: mvninstall 0m 29s the patch passed
+1 :green_heart: compile 0m 28s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javac 0m 28s the patch passed
+1 :green_heart: compile 0m 26s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: javac 0m 26s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
+1 :green_heart: checkstyle 0m 12s the patch passed
+1 :green_heart: mvnsite 0m 27s the patch passed
+1 :green_heart: javadoc 0m 21s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 0m 22s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: spotbugs 1m 24s the patch passed
+1 :green_heart: shadedclient 19m 55s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 :green_heart: unit 1m 50s hadoop-hdfs-client in the patch passed.
+1 :green_heart: asflicense 0m 24s The patch does not generate ASF License warnings.
84m 12s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6502/1/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/6502
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux c26e1bbd7a7a 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a1c48f705b7f6726982b19caf2737a38ed936c68
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6502/1/testReport/
Max. process+thread count 551 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs-client U: hadoop-hdfs-project/hadoop-hdfs-client
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6502/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Jan 26 '24 07:01 hadoop-yetus

@slfan1989 Hi, sir. Do you have time to review this ,Thanks!

LiuGuH avatar Jan 26 '24 08:01 LiuGuH

@LiuGuH Thanks for your report. Would you mind to add unit test to cover this case?

Hexiaoqiao avatar Jan 26 '24 09:01 Hexiaoqiao

@LiuGuH Hi, sir. I have one question here, Could you please explain more detailed how the connection leak? I see we invoke setKeepAlive(true) in DataStreamer#createSocketForPipeline and DataXceiver#writeBlock. Thanks a lot.

hfutatzhanghb avatar Jan 26 '24 09:01 hfutatzhanghb

Thanks. And I modify this solution is not only in EC.

@LiuGuH Thanks for your report. Would you mind to add unit test to cover this case?

OK. This will need some time. And the test case only for peer release test. @Hexiaoqiao

LiuGuH avatar Jan 26 '24 09:01 LiuGuH

@LiuGuH Hi, sir. I have one question here, Could you please explain more detailed how the connection leak? I see we invoke setKeepAlive(true) in DataStreamer#createSocketForPipeline and DataXceiver#writeBlock. Thanks a lot.

Client(Crash) -> DN1 -> DN2 ( Xceiver count full ,the throw IOException , and Server Peer is alive) -> DN3

In this case, the connection between DN1 <-> DN2 will never release. Thanks

LiuGuH avatar Jan 26 '24 10:01 LiuGuH

Thanks for review. @zhangshuyan0
The solution may be happened with datanodes that have heavy load IO. But the unit test case I can not reproduce. It has no relation with DFS_DATANODE_MAX_RECEIVER_THREADS_KEY.

With datanodes that have heavy load IO, in.close() and out.close() may be also throw IOException when close() is invoked and the socket may be not really closed.

LiuGuH avatar Mar 25 '24 02:03 LiuGuH

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
_ Prechecks _
+1 :green_heart: dupname 0m 00s No case conflicting files found.
+0 :ok: spotbugs 0m 00s spotbugs executables are not available.
+0 :ok: codespell 0m 01s codespell was not available.
+0 :ok: detsecrets 0m 01s detect-secrets was not available.
+1 :green_heart: @author 0m 00s The patch does not contain any @author tags.
-1 :x: test4tests 0m 00s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 :green_heart: mvninstall 85m 14s trunk passed
+1 :green_heart: compile 5m 07s trunk passed
+1 :green_heart: checkstyle 4m 21s trunk passed
+1 :green_heart: mvnsite 5m 11s trunk passed
+1 :green_heart: javadoc 4m 36s trunk passed
+1 :green_heart: shadedclient 139m 27s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 :green_heart: mvninstall 3m 01s the patch passed
+1 :green_heart: compile 2m 35s the patch passed
+1 :green_heart: javac 2m 35s the patch passed
+1 :green_heart: blanks 0m 00s The patch has no blanks issues.
+1 :green_heart: checkstyle 2m 05s the patch passed
+1 :green_heart: mvnsite 2m 47s the patch passed
+1 :green_heart: javadoc 2m 15s the patch passed
+1 :green_heart: shadedclient 147m 52s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 :green_heart: asflicense 5m 07s The patch does not generate ASF License warnings.
396m 17s
Subsystem Report/Notes
GITHUB PR https://github.com/apache/hadoop/pull/6502
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname MINGW64_NT-10.0-17763 0a31c1737811 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool maven
Personality /c/hadoop/dev-support/bin/hadoop.sh
git revision trunk / a1c48f705b7f6726982b19caf2737a38ed936c68
Default Java Azul Systems, Inc.-1.8.0_332-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6502/1/testReport/
modules C: hadoop-hdfs-project/hadoop-hdfs-client U: hadoop-hdfs-project/hadoop-hdfs-client
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6502/1/console
versions git=2.44.0.windows.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Apr 25 '24 21:04 hadoop-yetus