amazon-linux-2023 icon indicating copy to clipboard operation
amazon-linux-2023 copied to clipboard

[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor)

Open john-forrest opened this issue 1 year ago • 8 comments

Describe the bug We are running AL2023 on several T3 ec2 instances - they have replaced some instances that ran centos 7 (now passed EOL) but we also have instances that run AL2 with similar items - we have never seen this issue before. In particular we have two instances that regularly appear to freeze up - their status is "running" but we cease to be able to connect to them over TCP and can only recover by rebooting. These instances run sonarqube and gitlab on docker containers - we don't get too much control over the internals, although have followed instructions for gitlab at least to run with less resources as much as possible, thinking this might help. Theory is that we are seeing the same issue as described in https://gist.github.com/raggi/1f8d0b9f45c5b62e7131b03e6e2ffe68 although that is ubuntu, it definitely sounds the same. We raised this AWS Support and were told the basis was that we'd used all our CPU Credits and thus the instance had been slowed down. From our viewpoint though the real issue is the way AL2023 is handling that.

I have hoped that, esp. since there is apparently a fix for this, AL2023 would itself be fixed, but no sign. We've added monitors that will reboot the instances automatically should they stop responding but it is not ideal and we have job failures etc for when the freeze happens.

To Reproduce Steps to reproduce the behavior:

  1. Run an app on the instance that has occasional peaks in CPU.
  2. At some point the system will freeze.

Expected behavior We should never hit this scenario

Screenshots

image (2)

Additional context

We did the original analysis a few months ago but issue still occurring. We are keeping up to date on AL2023 releases.

john-forrest avatar Aug 06 '24 09:08 john-forrest

I am not convinced by the explanation about using up CPU resources... it does look like DHCP is timing out which looks more like a networking problem to me... unless of course the CPU is slowed down so much that systemd fails to receive the DHCP responses but that sounds far fetched to me but the bug you linked does seem to indicate it as being a possibility...

We will attempt to get to the bottom of it

ozbenh avatar Aug 09 '24 03:08 ozbenh

@ozbenh Any update on this?

john-forrest avatar Aug 30 '24 10:08 john-forrest

I'm seeing the same thing, on t4g.small instances in an autoscale group, created with the AMI al2023-ami-ecs-hvm-2023.0.20250121-kernel-6.1-arm64 in us-east-2. They'll run fine for a day or two, hit the same networking error as listed above, and then lose networking connectivity. The SSM agent disconnects, the ASG thinks the instance is still healthy, and ECS gets very confused. The only fix is to terminate or reboot the instances. Two separate instances in the past two days:

Jan 28 05:49:48 ip-10-100-17-18.us-east-2.compute.internal systemd-networkd[1408]: ens5: Could not set DHCPv4 address: Connection timed out
Jan 28 05:49:52 ip-10-100-17-18.us-east-2.compute.internal systemd-networkd[1408]: ens5: Failed

and

Jan 29 09:34:02 ip-10-100-48-33.us-east-2.compute.internal systemd-networkd[1409]: ens5: Could not set DHCPv4 address: Connection timed out
Jan 29 09:34:29 ip-10-100-48-33.us-east-2.compute.internal systemd-networkd[1409]: ens5: Failed

Here's the CPU graph for the first, hash marks are when it lost networking.

Image

And the second

Image

The first is definitely maxed out on CPU, the second not, which is ... interesting. In both cases they're nowhere close to running out of CPU credits.

rtkmhart avatar Jan 29 '25 17:01 rtkmhart

Is there any progress on this issue? I have the same problem described above!

bmaluff-willdom avatar Mar 24 '25 18:03 bmaluff-willdom

Amazon Linux is currently investigating. I have not tested this theory but it could be explained by this Issue in the upstream. https://github.com/systemd/systemd/issues/25441 and they did post a workaround.

joeysk2012 avatar Apr 01 '25 19:04 joeysk2012

I was able to reliably reproduce this issue by starting hundreds of nano instances and then running a stress test via stress-ng for 2 hours. Several instances became non-contactable after the test and unable to recover DHCP due to connection timeout. Amazon Linux is working on a fix.

joeysk2012 avatar Apr 07 '25 17:04 joeysk2012

we updated our instances from AL2 to AL2023 yesterday and started encountering occasional socket TimeoutErrors as well :(

flow3d avatar Jul 22 '25 17:07 flow3d

Hi, A new version of systemd systemd-252.23-6 with support for SYSTEMD_NETLINK_DEFAULT_TIMEOUT was released with 2023.8.20250808. Can you please try this release and see if it helps the issue?

rajibade avatar Sep 02 '25 21:09 rajibade