CoreOS 1855.4.0 AWS EBS Mount Lockup
Issue Report
Bug
Ignition crashes system if storage.filesystem is specified
CT Input
storage:
filesystems:
- name: data
mount:
device: /dev/sdb
format: ext4
wipe_filesystem: true
label: DATA
Convert to userdata ct < test.ct
{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}
Container Linux Version
CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72) ct v0.9.0
Environment
AWS
Expected Behavior
At minimum format my block device
Actual Behavior
System does not boot Can't login, so can't get logs Screenshot https://www.evernote.com/l/AE__MLODCjJN_p8vv9G_LkqC2nBnb6BbAqI
Reproduction Steps
- Create EC2 instance, attach 80GB EBS to /dev/sdb, add user data, boot and crash
Other Information
Worked before on older CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)
Manually booting without CT/Ignition allows manual format/mounting of /dev/sdb (mounting by label is also no problem)
Thanks for the report. This probably isn't an Ignition bug but rather a kernel bug since Ignition didn't change between 1855.3.0 and 1855.4.0. Can you repro on alpha?
Will check tomorrow :)
@ajeddeloh - Same issue on alpha: CoreOS-alpha-1925.0.0-hvm (ami-01d20d68c856200cc)
Also please note previous working version was much older: CoreOS-stable-1688.5.3-hvm (ami-a2b6a2de)
This happens for me as well on the latest gen instances. Switching instances from t2 and t3 results into the system hanging on a systemd unit that's waiting for device "/dev/xvdg".
Perhaps this has something to do with switching to the NVME names that t3 instances do.
Fetched one of the logs from the machines, I see lots of these messages:
[[0m[0;31m* [0m] (1 of 3) A start job is running for dev-xvdg.device (4s / 1min 30s)
[K[[0;1;31m*[0m[0;31m* [0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
[K[[0;31m*[0;1;31m*[0m[0;31m* [0m] (1 of 3) A start job is running for dev-xvdg.device (5s / 1min 30s)
[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] (2 of 3) A start job is running for Ignition (disks) (10s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*[0m] (2 of 3) A start job is running for Ignition (disks) (11s / no limit)
[K[ [0;31m*[0;1;31m*[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
[K[ [0;31m*[0m] (3 of 3) A start job is running for…mapper-usr.device (12s / no limit)
[K[ [0;31m*[0;1;31m*[0m] (3 of 3) A start job is running for…mapper-usr.device (13s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*[0m]
(1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] (1 of 3) A start job is running for dev-xvdg.device (9s / 1min 30s)
[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] (1 of 3) A start job is running for dev-xvdg.device (10s / 1min 30s)
[K[[0;31m*[0;1;31m*[0m[0;31m* [0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
[K[[0;1;31m*[0m[0;31m* [0m] (2 of 3) A start job is running for Ignition (disks) (15s / no limit)
[K[[0m[0;31m* [0m] (2 of 3) A start job is running for Ignition (disks) (16s / no limit)
[K[[0;1;31m*[0m[0;31m* [0m] (3 of 3) A start job is running for…mapper-usr.device (16s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m* [0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)[ 24.010121] systemd-networkd[242]: eth0: Configured
[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] (3 of 3) A start job is running for…mapper-usr.device (17s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] (1 of 3) A start job is running for dev-xvdg.device (13s / 1min 30s)
[K[ [0;31m*[0;1;31m*[0m[0;31m*[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
[K[ [0;31m*[0;1;31m*[0m] (1 of 3) A start job is running for dev-xvdg.device (14s / 1min 30s)
[K[ [0;31m*[0m] (2 of 3) A start job is running for Ignition (disks) (19s / no limit)
It eventually times out:
[ 101.111108] systemd[1]: Timed out waiting for device dev-xvdg.device.
[[0;1;31mFAILED[0m[ 101.154243] ] ignitionFailed to start Ignition (disks).[415]:
disks: createFilesystems: op(1): [failed] waiting for devices [/dev/xvdg]: device unit dev-xvdg.device timeout
See 'systemctl status ignition-disks.service' for details.[ 101.159042]
systemd[1]: dev-xvdg.device: Job dev-xvdg.device/start failed with result 'timeout'.
@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(
@enieuw Hey... how did you get that system log? My AWS EC2 system logs are blank :(
It takes a while but eventually they show up under "Instance Settings -> Get system log".
Which instance type are you running by the way?
For this debug session I was running t2/t3/m5 (can't remember the exact size)
Creating a VM as a t2 instance and then changing the instance type to t3 works, I actually see the symlinks working:
Container Linux by CoreOS stable (1855.4.0)
core@ip-10-14-30-4 ~ $ systemctl status dev-xvdg.device
● dev-xvdg.device - Amazon Elastic Block Store
Follow: unit currently follows state of sys-devices-pci0000:00-0000:00:1f.0-nvme-nvme1-nvme1n1.device
Loaded: loaded
Active: active (plugged) since Fri 2018-10-19 06:49:00 UTC; 1min 8s ago
Device: /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
Oct 19 06:49:00 ip-10-14-30-4 systemd[1]: Found device Amazon Elastic Block Store.
core@ip-10-14-30-4 ~ $ date
Fri Oct 19 06:50:14 UTC 2018
core@ip-10-14-30-4 ~ $ ls -al /dev/xvdg
lrwxrwxrwx. 1 root root 7 Oct 19 06:49 /dev/xvdg -> nvme1n1
core@ip-10-14-30-4 ~ $
Creating a fresh T3 instance results in the hanging behaviour
Ah I'm betting that's because when changing the instance type that Ignition doesn't run on the second boot.
Yeah most likely. It doesn't trigger the wait for the systemd unit and then it does continue booting.
If I specify /dev/nvme1n1 in my ignition file it does boot properly. Perhaps the call to systemd is done before udev has mapped the aliases added by #2399
@enieuw I think you are waiting for https://github.com/coreos/bootengine/pull/149 to do that.
Okay looks like a mismatch between assigning EBS to /dev/sdb in the AWS console and /dev/xvdb appearing in linux.
ap-northeast-1 t2.micro CoreOS-stable-1855.4.0-hvm (ami-086eb64b7f4485a72) Root device /dev/xvda Block devices /dev/xvda /dev/sdb
{"ignition":{"config":{},"security":{"tls":{}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"filesystems":[{"mount":{"device":"/dev/sdb","format":"ext4","label":"DATA","wipeFilesystem":true},"name":"data"}]},"systemd":{}}
Results in:
[0;1;31m[0mdisks: createFilesystems: op(1): [started] waiting for devices [/dev/sdb]
[0;1;31m[0mdisks: createFilesystems: op(1): [failed] waiting for devices [/dev/sdb]: device unit dev-sdb.device timeout
[0;1;31m[0mdisks: failed to create filesystems: failed to wait on filesystems devs: device unit dev-sdb.device timeout
Updating the config from sdb -> xvdb finishes the boot.
Is there already a ticket for sdb vs xvdb? I think on some systems (can't remember) /dev/sdb shows up instead.
As a follow on thought, it seems sometimes EBS volumes (add-on disks on AWS) show up as /dev/sdb and sometimes as /dev/xvdb. That makes ignition scripts fail if mismatched and makes it difficult to use the same script on various servers.
Is there any guidance on /dev/sdb vs /dev/xvdb going forward in coreos? Perhaps following such guidance would have prevented this ticket.
@pctj101 this is an unfortunate choice on AWS side, see https://github.com/coreos/bugs/issues/2399#issuecomment-422345804. Their volumes/instances/names grid is documented here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/device_naming.html
@lucab - Yes I too have seen where my EC2 launch spec and coreos device path mismatch. I think it's related to this item on the same page as the link you provided:
Depending on the block device driver of the kernel, the device could be attached with a different name than you specified. For example, if you specify a device name of /dev/sdh, your device could be renamed /dev/xvdh or /dev/hdh.
So it seems that the kernel configuration (and thus CoreOS) may also have some interaction. So it's not just "It's AWS", but "It's AWS and How CoreOS interact" which is why I'm bringing up this question. :)
Anyways, yes I read the other thread you linked to. I can share with you that on device mapping I've abandoned Ignition and resorted to a series of shell scripts to format and mount things properly (despite instance type changes). I'm not sure if that's the long term way to do it, but I'm pretty sure the discussion either way is lengthy and has plenty of ideology to go with it :)
When it comes to AWS totally changing device paths for NVMe, even I have trouble justifying automagic resolution with ignition.
It's definitely a usability discussion rather than a bug discussion.
Possibly related: #2481.