[Bug] Failure during vcpu run: Out of memory on Ubuntu 6.8.0-x kernels
[!CAUTION] Ubuntu 6.8.x kernels are not officially supported by Firecracker. If you need a stable environment, please use 6.1 or 5.10 kernels from amazon-linux, which are not impacted by this issue.
Description
We are aware of an issue impacting Firecracker on the latest Ubuntu 6.8.x kernels (from 6.8.0-58-generic and 6.8.0-1027-aws) which rarely causes the FC process to crash with an "out of memory" error when it attempts to run the vcpu for the first time, despite the system having enough available memory.
2025-04-28T15:52:22.965639404 [951827fb-9384-45a4-9bbe-1816fe18de87:fc_vcpu 0] Failure during vcpu run: Out of memory (os error 12)
Cause
The issue is triggered by a race condition caused by the vmm thread sending a SIGRTMIN to the vcpu thread, while it is starting the nx huge page recover thread. This makes the thread creation fail, but due to a bug in the kernel, it is classified as a ENOMEM, instead of a ERESTARTNOINTR, which should be retried.
More details are available in the upstream patch fixing the issue: https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=916b7f42b3b3b539a71c204a9b49fdc4ca92cd82
Fix
There is no known work-around at the moment, other than retrying the VM creation.
For a complete fix, Ubuntu should back-port cb380909ae3b and 916b7f42b3b3 from upstream.
The resolution is being tracked by Ubuntu in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859
Impact
To the best of our knowledge, this only impacts Ubuntu kernels 6.8.x starting from 6.8.0-58-generic and 6.8.0-1027-aws, as the patch introducing the bug has been back-ported without the related fix. It may also impact 6.13 kernel (non-LTS), as the fix was merged in 6.14 and back-ported to 6.12 upstream.
No officially supported kernels (5.10 and 6.1), nor the mainline kernel, are impacted by this bug.
How to reproduce
The bug is reproducible by running the test_cycled_snapshot_restore test at high concurrency:
./tools/devtool test -- -n16 integration_tests/functional/test_snapshot_basic.py::test_cycled_snapshot_restore
I have verified that the mentioned patch-series fixes the issue in our integration tests, after applying it on top of Ubuntu-aws-6.8.0-1027.29.
There is a launchpad issue for this now: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859
Is there a LKML issue tracking the backport of the fix to the 6.8.x kernel?
The patch that introduced the bug hasn't been backported to 6.8.x upstream. It's only present in downstream Ubuntu fork.
I can see from launchpad that the fix has been committed and is queued for the next Ubuntu kernel release https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859
@Manciukic, hi. The 6.8.0-62.65 generic kernel with the fix is now in the -proposed pocket. Would you be able to test that this kernel solves the issue? There are instructions here https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109859/comments/1
@magalilemes Thanks for the update!
I verified that the 6.8.0-62.65 generic (linux-generic/noble-proposed) fixes the issue by running the reproducer on a c5n.metal instance:
ubuntu@ip-172-31-0-29:~/firecracker$ uname -r
6.8.0-62-generic
ubuntu@ip-172-31-0-29:~/firecracker$ sudo apt list --installed | grep linux-generic
linux-generic/noble-proposed,now 6.8.0-62.65 amd64 [installed]
ubuntu@ip-172-31-0-29:~/firecracker$ ./tools/devtool test -- -n16 integration_tests/functional/test_snapshot_basic.py::test_cycled_snapshot_restore
[Firecracker devtool 2025-05-21T17:40:40+00:00] Fetching CI artifacts from S3
[...]
=========================================== 84 passed in 133.64s (0:02:13) ===========================================
[Firecracker devtool 2025-05-21T17:43:17+00:00] Finished test run ...
@Manciukic, awesome! Thank you for testing!
Testing with kernel version: 6.8.0-1030-aws our CI is now passing. Marking the issue as resolved.