amazon-genomics-cli Explore TCP kernel flags to improve worker EC2 I/O reliability

Description

When running under high network load a server may get a "connection reset by peer" error. This is most often seen with aws s3 cp operations during workflow "scatter" steps and on smaller EC2 instances with burstable throughput rather than fixed throughput.

Proposed Solution

From the S3 team:

"You may be exhausting the socket pool and attempting to reuse sockets before they are fully closed and/or the sockets may be timing out before the requests complete. If you want to experiment with some client-side settings that may help alleviate the issue, try:

expanding the ephemeral port range net.ipv4.ip_local_port_range
decreasing the TCP FIN timeout net.ipv4.tcp_fin_timeout
enable sockets in TIME_WAIT state to be reused with net.ipv4.tcp_tw_recycle=1 and net.ipv4.tcp_tw_reuse=1 "

Other kernel settings we could try include https://www.ibm.com/docs/en/linux-on-systems?topic=tuning-tcpip-ipv4-settings

These settings would be used in the LaunchTemplate of the EC2 worker nodes.

Other information

Mar 23 '22 20:03 markjschreiber

Greetings! Sorry to say but this is a very old issue that is probably not getting as much attention as it deserves. We encourage you to check if this is still an issue in the latest release and if you find that this is still a problem, please feel free to open a new one.

Jun 22 '22 00:06 github-actions[bot]

Worth keeping

Jun 23 '22 01:06 markjschreiber

No longer a priority

Oct 27 '22 20:10 markjschreiber