amazon-genomics-cli icon indicating copy to clipboard operation
amazon-genomics-cli copied to clipboard

Explore TCP kernel flags to improve worker EC2 I/O reliability

Open markjschreiber opened this issue 3 years ago • 2 comments

Description

When running under high network load a server may get a "connection reset by peer" error. This is most often seen with aws s3 cp operations during workflow "scatter" steps and on smaller EC2 instances with burstable throughput rather than fixed throughput.

Proposed Solution

From the S3 team:

"You may be exhausting the socket pool and attempting to reuse sockets before they are fully closed and/or the sockets may be timing out before the requests complete. If you want to experiment with some client-side settings that may help alleviate the issue, try:

  • expanding the ephemeral port range net.ipv4.ip_local_port_range
  • decreasing the TCP FIN timeout net.ipv4.tcp_fin_timeout
  • enable sockets in TIME_WAIT state to be reused with net.ipv4.tcp_tw_recycle=1 and net.ipv4.tcp_tw_reuse=1 "

Other kernel settings we could try include https://www.ibm.com/docs/en/linux-on-systems?topic=tuning-tcpip-ipv4-settings

These settings would be used in the LaunchTemplate of the EC2 worker nodes.

Other information

markjschreiber avatar Mar 23 '22 20:03 markjschreiber

Greetings! Sorry to say but this is a very old issue that is probably not getting as much attention as it deserves. We encourage you to check if this is still an issue in the latest release and if you find that this is still a problem, please feel free to open a new one.

github-actions[bot] avatar Jun 22 '22 00:06 github-actions[bot]

Worth keeping

markjschreiber avatar Jun 23 '22 01:06 markjschreiber

No longer a priority

markjschreiber avatar Oct 27 '22 20:10 markjschreiber