Cloud-init 24.1 fails if cloud-init local runs before the network adapter is loaded
Bug report
Cloud-init 24.1 fails to bootstrap on t3.medium AWS instances:
cloud-init local runs too early and fails to configure the instance
It was working with cloud-init 23.3 before mid-February 2024
Steps to reproduce the problem
boot a t3.medium instance with Ubuntu 22.04 AMI using cloud-init 24.1
Environment details
- Cloud-init version: 24.1
- Operating System Distribution: ubuntu 22.04
- Cloud provider, platform or installer type: AWS t3.medium instances
cloud-init logs
journalctl --system --boot
ay 13 12:10:02 ip-10-251-75-102 cloud-init[171]: Cloud-init v. 24.1.3-0ubuntu1~22.04.1 running 'init-local' at Mon, 13 May 2024 11:10:02 +0000. Up 9.30 seconds.
May 13 12:10:02 ip-10-251-75-102 cloud-init[171]: 2024-05-13 11:10:02,461 - distros[WARNING]: Did not find a fallback interface on distro: ubuntu.
May 13 12:10:02 ip-10-251-75-102 cloud-init[171]: 2024-05-13 11:10:02,469 - distros[WARNING]: Did not find a fallback interface on distro: ubuntu.
May 13 12:10:02 ip-10-251-75-102 kernel: cryptd: max_cpu_qlen set to 1000
May 13 12:10:02 ip-10-251-75-102 cloud-init[171]: 2024-05-13 11:10:02,472 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2Local'> failed
May 13 12:10:02 ip-10-251-75-102 kernel: ena 0000:00:05.0: ENA device version: 0.10
May 13 12:10:02 ip-10-251-75-102 kernel: ena 0000:00:05.0: ENA controller version: 0.0.1 implementation version 1
May 13 12:10:02 ip-10-251-75-102 kernel: ena 0000:00:05.0: LLQ is not supported Fallback to host mode policy.
May 13 12:10:02 ip-10-251-75-102 kernel: ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem c0400000, mac addr 06:b3:3b:e2:7d:d1
May 13 12:10:02 ip-10-251-75-102 kernel: parport_pc 00:03: reported by Plug and Play ACPI
May 13 12:10:02 ip-10-251-75-102 systemd[1]: Found device /dev/ttyS0.
May 13 12:10:02 ip-10-251-75-102 kernel: AVX2 version of gcm_enc/dec engaged.
May 13 12:10:02 ip-10-251-75-102 kernel: AES CTR mode by8 optimization enabled
May 13 12:10:02 ip-10-251-75-102 systemd[1]: Finished Initial cloud-init job (pre-networking).
May 13 12:10:02 ip-10-251-75-102 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
May 13 12:10:02 ip-10-251-75-102 systemd-udevd[160]: Using default interface naming scheme 'v249'.
May 13 12:10:02 ip-10-251-75-102 systemd-udevd[158]: nvme0n1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/nvme0n1' failed with exit code 1.
May 13 12:10:02 ip-10-251-75-102 kernel: ppdev: user-space parallel port driver
May 13 12:10:02 ip-10-251-75-102 kernel: ena 0000:00:05.0 ens5: renamed from eth0
/var/log/cloud-init.log
2024-05-13 16:03:33,099 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2Local'> failed
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 1028, in find_source
if s.update_metadata_if_supported(
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 914, in update_metadata_if_supported
result = self.get_data()
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py", line 724, in get_data
return super(DataSourceEc2Local, self).get_data()
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 460, in get_data
return_value = self._check_and_get_data()
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 392, in _check_and_get_data
return self._get_data()
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py", line 146, in _get_data
with EphemeralIPNetwork(
File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 407, in __enter__
EphemeralIPv6Network(
File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 222, in __init__
raise ValueError("Cannot init network on {0}".format(interface))
ValueError: Cannot init network on None
2024-05-13 16:03:33,107 - main.py[DEBUG]: No local datasource found
@misdoro does this reproduce? Could you please share which interfaces are available after boot and which drivers are loaded?
@misdoro does this reproduce? Could you please share which interfaces are available after boot and which drivers are loaded?
For us it is 100% reproducible on an AMI image we are building internally, when started on t3 class aws instances.
the only non lo network adapter is managed by ena driver, and it gets recognized by the kernel a few seconds after cloud-init local is started.
For the moment we've implemented a work-around to delay the cloud-init local after the network adapter is recognized, but I'm wondering if cloud-init could have a more official way to handle network adapters that appear late during the boot process.
The work-around in question: /etc/systemd/system/cloud-init-local.service.d/10-wait-for-net-device.conf
# cloud-init-local must wait for at least one network interface device to exist
# before attempting to download EC2 instance metadata.
#
# These systemd unit directives implement this policy along with
# /etc/udev/rules.d/10-ec2imds.rules
[Unit]
Requires=dev-ec2imds.device
After=dev-ec2imds.device
/etc/udev/rules.d/10-ec2imds.rules
# cloud-init-local must wait for at least one network interface device to exist
# before attempting to download EC2 instance metadata.
#
# These udev rules implement this policy along with
# /etc/systemd/system/cloud-init.local.service.d/10-wait-for-net-device.conf
ACTION!="remove", SUBSYSTEM=="net", KERNEL!="lo", DRIVERS=="ena|vif", TAG+="systemd", ENV{SYSTEMD_ALIAS}+="/dev/ec2imds"
For us it is 100% reproducible on an AMI image we are building internally, when started on t3 class aws instances.
the only non lo network adapter is managed by ena driver, and it gets recognized by the kernel a few seconds after cloud-init local is started.
Good to know, thank you. How would one reproduce this? How can you ensure that only an ena interface is available?
For the moment we've implemented a work-around to delay the cloud-init local after the network adapter is recognized, but I'm wondering if cloud-init could have a more official way to handle network adapters that appear late during the boot process.
Cloud-init should handle this better. Can you please share more of the log? The whole cloud-init.log would be best, but if you feel the need to redact, if there is a line like the following in the log, it would be good to know what it says:
2024-05-14 14:50:46,817 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'type': 'physical', 'name': 'enp5s0', 'subnets': [{'type': 'dhcp', 'control': 'auto'}]}]}
@misdoro Thanks again for reporting. If you can share any additional data about your image (more complete logs, reproducer), that would be extremely helpful.
For EC2, and probably other datasources as well, cloud-init-local.service needs to wait until at least one interface is available prior to proceeding into ephemeral network setup.
Current state
Cloud-init already does something similar, but with a different intent and outcome. Cloud-init currently polls on configured interfaces when a network configuration is available and waits on those configured interfaces to exist. Once these are available, cloud-init manually does interface rename.
Problems
-
Interface rename shouldn't actually be required in many cases (netplan, systemd, and friends are capable of doing rename). This logic predates current network backends.
-
The Local service doesn't wait for physical devices to exist before attempting to bring up an ephemeral interface. This seems to work when kernel drivers are loaded by initramfs as a module or built into the kernel.
Proposed fix
-
short term: add a poll for a single interface[1][2]
-
long term: only do interface rename in renderers which require it (possibly eni, ifconfig, sysconfig?). Initially we should retain current functionality for untested renderers and potentially add an opt-out flag to allow testing the different network back ends for working rename support.
[1] not wanted for LXD, None, NoCloud, and any other datasources which do not require an interface to be available in Local stage
[2] udevadm settle causes unnecessary waiting. Polling at some frequency would probably be more appropriate.
Related Fedora bug report: https://bugzilla.redhat.com/show_bug.cgi?id=2329833