Flatcar icon indicating copy to clipboard operation
Flatcar copied to clipboard

Intermittent AWS AMI corruptions

Open dongsupark opened this issue 3 years ago • 4 comments

Description

During the recent release process, we encountered an unknown issue that os/kola/aws failed only with AWS arm64 of Stable 3227.2.2.

Console log of the Kola test says:

[    4.245983] systemd-fsck[680]: ROOT contains a file system with errors, check forced.
ROOT: fsck 0.0% complete...
[    4.340402] device-mapper: verity: sha256 using implementation "sha256-ce"
ROOT: fsck 81.4% complete...
[    4.316076] systemd-fsck[680]: ROOT: Directory inode 7252, block #0, offset 0: directory corrupted
[    4.317332] systemd-fsck[680]: ROOT: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
[    4.317526] systemd-fsck[680]: (i.e., without -a or -p options)
[    4.324259] systemd-fsck[674]: fsck failed with exit status 4.
[FAILED] Failed to start File Syste…ck on /dev/disk/by-label/ROOT.

Tried rerunning the specific kola tests, no luck. Tried rerunning the whole vm-matrix to regenerate the AMIs, and running kola tests again. Still no luck. It is obviously not possible to manually initiate an EC2 instance from the problematic AMI.

Impact

AWS kola tests for arm64 cannot run at all.

Environment and steps to reproduce

There is no simple way to reproduce this issue. That issue happens only in the specific case, not in other channels, not in other archs. We have seen a similar issue in this year, but not in Stable, not arm64.

dongsupark avatar Sep 01 '22 12:09 dongsupark

This is not intermittent to me. All my arm64 instances fail on stable. They run fine on beta.

sdlarsen avatar Oct 12 '22 06:10 sdlarsen

@dongsupark if the AMI is corrupt we should take it down

jepio avatar Oct 13 '22 11:10 jepio

@jepio As describe in the comment, the corrupt AMI was set to private.

dongsupark avatar Oct 13 '22 11:10 dongsupark

I've seen this corruption on stable 3227.2.2 arm64 AMIs in us-east-2 in the last few days. I'd suspect the issue affects the whole set, rather than just one AMI (since they're regional)

dghubble avatar Oct 13 '22 15:10 dghubble

I've traced it down to AWS EC2 imports when the VMDK format is used. With the plain format everything works but after conversion to VMDK the same image becomes a corrupted AMI while locally it's still ok to boot the file with QEMU. As workaround I've prepared a change that will create the Flatcar AMIs from plain image uploads: https://github.com/flatcar/mantle/pull/391

pothos avatar Oct 28 '22 15:10 pothos

For the record, instead of going through vmdk-convert as we do now I've also tried to use qemu-img directly to create the streamOptimized VMDK but it didn't help (qemu-img convert -O vmdk -o subformat=streamOptimized,adapter_type=lsilogic flatcar_production_ami_image.bin flatcar_production_ami_image.vmdk).

pothos avatar Oct 31 '22 12:10 pothos

For a while now (Oct 31) we switched to using the raw flatcar_production_ami_image.bin.bz2 image instead of VMDK and this worked around the issue. We tried to reach out to AWS but need to do so again to get the broken VMDK handling resolved.

pothos avatar Dec 08 '22 09:12 pothos

Workaround is in place. Have not seen the issue recently.

dongsupark avatar Sep 08 '23 13:09 dongsupark

My report to AWS didn't get acted on, so maybe we can just warn users about using the AMI images because we still publish them on the release server for download and mention them in the docs.

pothos avatar Sep 08 '23 13:09 pothos

PR https://github.com/flatcar/flatcar-docs/pull/334

dongsupark avatar Sep 08 '23 14:09 dongsupark