firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

[Bug] Failure during vcpu run: Out of memory on Ubuntu 6.8.0-x kernels

Open Manciukic opened this issue 9 months ago • 7 comments

[!CAUTION] Ubuntu 6.8.x kernels are not officially supported by Firecracker. If you need a stable environment, please use 6.1 or 5.10 kernels from amazon-linux, which are not impacted by this issue.

Description

We are aware of an issue impacting Firecracker on the latest Ubuntu 6.8.x kernels (from 6.8.0-58-generic and 6.8.0-1027-aws) which rarely causes the FC process to crash with an "out of memory" error when it attempts to run the vcpu for the first time, despite the system having enough available memory.

2025-04-28T15:52:22.965639404 [951827fb-9384-45a4-9bbe-1816fe18de87:fc_vcpu 0] Failure during vcpu run: Out of memory (os error 12)

Cause

The issue is triggered by a race condition caused by the vmm thread sending a SIGRTMIN to the vcpu thread, while it is starting the nx huge page recover thread. This makes the thread creation fail, but due to a bug in the kernel, it is classified as a ENOMEM, instead of a ERESTARTNOINTR, which should be retried.

More details are available in the upstream patch fixing the issue: https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=916b7f42b3b3b539a71c204a9b49fdc4ca92cd82

Fix

There is no known work-around at the moment, other than retrying the VM creation.

For a complete fix, Ubuntu should back-port cb380909ae3b and 916b7f42b3b3 from upstream.

The resolution is being tracked by Ubuntu in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859

Impact

To the best of our knowledge, this only impacts Ubuntu kernels 6.8.x starting from 6.8.0-58-generic and 6.8.0-1027-aws, as the patch introducing the bug has been back-ported without the related fix. It may also impact 6.13 kernel (non-LTS), as the fix was merged in 6.14 and back-ported to 6.12 upstream.

No officially supported kernels (5.10 and 6.1), nor the mainline kernel, are impacted by this bug.

How to reproduce

The bug is reproducible by running the test_cycled_snapshot_restore test at high concurrency:

./tools/devtool test -- -n16 integration_tests/functional/test_snapshot_basic.py::test_cycled_snapshot_restore

Manciukic avatar Apr 30 '25 17:04 Manciukic

I have verified that the mentioned patch-series fixes the issue in our integration tests, after applying it on top of Ubuntu-aws-6.8.0-1027.29.

Manciukic avatar Apr 30 '25 18:04 Manciukic

There is a launchpad issue for this now: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859

Is there a LKML issue tracking the backport of the fix to the 6.8.x kernel?

andrewla avatar May 05 '25 19:05 andrewla

The patch that introduced the bug hasn't been backported to 6.8.x upstream. It's only present in downstream Ubuntu fork.

Manciukic avatar May 06 '25 09:05 Manciukic

I can see from launchpad that the fix has been committed and is queued for the next Ubuntu kernel release https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859

Manciukic avatar May 21 '25 12:05 Manciukic

@Manciukic, hi. The 6.8.0-62.65 generic kernel with the fix is now in the -proposed pocket. Would you be able to test that this kernel solves the issue? There are instructions here https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859/comments/1

magalilemes avatar May 21 '25 13:05 magalilemes

@magalilemes Thanks for the update! I verified that the 6.8.0-62.65 generic (linux-generic/noble-proposed) fixes the issue by running the reproducer on a c5n.metal instance:

ubuntu@ip-172-31-0-29:~/firecracker$ uname -r                                                                         
6.8.0-62-generic

ubuntu@ip-172-31-0-29:~/firecracker$ sudo apt list --installed | grep linux-generic
linux-generic/noble-proposed,now 6.8.0-62.65 amd64 [installed]

ubuntu@ip-172-31-0-29:~/firecracker$ ./tools/devtool test -- -n16 integration_tests/functional/test_snapshot_basic.py::test_cycled_snapshot_restore
[Firecracker devtool 2025-05-21T17:40:40+00:00] Fetching CI artifacts from S3                          
[...]
=========================================== 84 passed in 133.64s (0:02:13) ===========================================
[Firecracker devtool 2025-05-21T17:43:17+00:00] Finished test run ...

Manciukic avatar May 21 '25 17:05 Manciukic

@Manciukic, awesome! Thank you for testing!

magalilemes avatar May 21 '25 18:05 magalilemes

Testing with kernel version: 6.8.0-1030-aws our CI is now passing. Marking the issue as resolved.

JackThomson2 avatar Jun 24 '25 16:06 JackThomson2