oracle-linux icon indicating copy to clipboard operation
oracle-linux copied to clipboard

Latest GRUB update breaks booting

Open robertm98 opened this issue 1 year ago • 18 comments

This is a different bug compared to what is described in https://github.com/oracle/oracle-linux/issues/147

When the latest updates are applied and a server is then rebooted GRUB will not start and appears to be stuck in a busy loop displaying the following message. "error: ../../grub-core/commands/efi/tpm.c:150:unknown TPM error"

Secure Boot is disabled and no previous problems.

Steps to reproduce:

Download and install OL 9.4 x86_64 OK for first boot. Apply updates Reboot and GRUB will then fail to load with the above error message.

As a cross check a fresh install was done and grub updates were excluded with exclude=grub* in the /etc/dnf/dnf.conf file.

The non-grub updates were installed and the server rebooted OK.

robertm98 avatar Jun 25 '24 08:06 robertm98

Hello! Thanks for the report, in fact last update issued for linked issue has zero code changes, though it MIGHT have regenerated a grub config for you, maybe that is triggering the issue. Are you seeing any other errors except for unknown TPM error ? Are you using BTRFS filesystem or/and BTRFS snapshots ?

aburmash avatar Jun 25 '24 08:06 aburmash

Nevermind, reproduced it, we are going to pull out this update and issue a proper one shortly.

aburmash avatar Jun 25 '24 09:06 aburmash

Thank you. For info the filesystem is XFS. A minor change is the name of lvm group form "ol" to "olb" so as not to clash with the volume group name of the the previous installation on the original drive when I copy files across. I wondered if this could be relevant due to the questions about the filesystem, but from your last reply probably not. The installation is on a separate SATA drive and all other drives are disconnected.

robertm98 avatar Jun 25 '24 09:06 robertm98

@robertm98 once again thank you very much! I see that it is not related to filesystems, just broken grub config.

aburmash avatar Jun 25 '24 09:06 aburmash

same issue here, is there any way to fix broken grub / grub.cfg from within UEFI interactive shell?

m45733r avatar Jun 25 '24 09:06 m45733r

The only way I think this could be repaired is to do a recovery boot from the installation media. chroot to /mnt/sysroot (I think) then possibly use dnf to do a roll back or edit the config. @aburmash Would it be possible to get the details of the errors in the config and what needs to be done to make things good, please? What needs editing and then running to apply the config changes.

robertm98 avatar Jun 25 '24 10:06 robertm98

@robertm98 @m45733r i will provide recovery instructions from UEFI shell shortly.

aburmash avatar Jun 25 '24 10:06 aburmash

@m45733r

  1. if you have already installed bad update, but did not reboot: grub2-mkconfig > /boot/grub2/grub.cfg OR grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg
  2. if you can only do stuff from UEFI shell.
  • identify which FS is your ESP partition to do that, just check all displayed partitions one by one, ESP is usually FS0
      FS0: Alias(s):HD0a1b:;BLK1:
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(1,GPT,3AF7074E-C0BB-400D-8FC7-E9EC738AA53F,0x800,0x32000)
     BLK0: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)
     BLK2: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(2,GPT,14BE7023-6C02-4573-8891-9F639B9D936A,0x32800,0x400000)
     BLK3: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(3,GPT,E700F071-90A5-40BB-8132-52AF688193B7,0x432800,0x5900800)****
fs0:
ls

if you see EFI dir, you are where you need to be

cd EFI/redhat
rm grub.cfg
grubx64.efi

you will be dropped to grub cmdline ls it will display list of disks available, there you need to find a disk that has /boot dir or identify /boot partition run ls <disk>/ to see which one is that for example: ls (hd0,gpt2)/ when you have found the /boot you will see something like

grub> ls (hd0,gpt2)/
./ ../ efi/ grub2/ loader/ vmlinuz-5.14.0-427.16.1.el9_4.x86_64
System.map-5.14.0-427.16.1.el9_4.x86_64 config-5.14.0-427.16.1.el9_4.x86_64
.vmlinuz-5.14.0-427.16.1.el9_4.x86_64.hmac
symvers-5.14.0-427.16.1.el9_4.x86_64.gz
initramfs-5.14.0-427.16.1.el9_4.x86_64.img
vmlinuz-5.15.0-206.153.7.el9uek.x86_64
System.map-5.15.0-206.153.7.el9uek.x86_64 config-5.15.0-206.153.7.el9uek.x86_64
.vmlinuz-5.15.0-206.153.7.el9uek.x86_64.hmac
symvers-5.15.0-206.153.7.el9uek.x86_64.gz
initramfs-5.15.0-206.153.7.el9uek.x86_64.img
initramfs-0-rescue-36703c3cdc50ff74e863e867384f6a8a.img
vmlinuz-0-rescue-36703c3cdc50ff74e863e867384f6a8a
initramfs-5.15.0-206.153.7.el9uek.x86_64kdump.img 

Now you need to check boot info for you kernel ls (hd0,gpt2)/loader/entries/

grub> ls (hd0,gpt2)/loader/entries/
./ ../ 8c622b7d13354f7fbe5eee50d3f340bd-5.14.0-427.16.1.el9_4.x86_64.conf
8c622b7d13354f7fbe5eee50d3f340bd-5.15.0-206.153.7.el9uek.x86_64.conf
36703c3cdc50ff74e863e867384f6a8a-0-rescue.conf

cat (hd0,gpt2)/loader/entries/8c622b7d13354f7fbe5eee50d3f340bd-5.15.0-206.153.7.el9uek.x86_64.conf You will see something like:

title Oracle Linux Server (5.15.0-206.153.7.el9uek.x86_64 with Unbreakable Ente
rprise Kernel) 9.4
version 5.15.0-206.153.7.el9uek.x86_64
linux /vmlinuz-5.15.0-206.153.7.el9uek.x86_64
initrd /initramfs-5.15.0-206.153.7.el9uek.x86_64.img $tuned_initrd
options root=/dev/mapper/ocivolume-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.dhcp=10 rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 crash_kexec_post_notifiers
grub_users $grub_users
grub_arg --unrestricted
grub_class ol

Now still in grub cmdline run:

linux (hd0,gpt2)/vmlinuz-5.15.0-206.153.7.el9uek.x86_64 root=/dev/mapper/ocivolume-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.dhcp=10 rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 crash_kexec_post_notifiers
initrd (hd0,gpt2)/initramfs-5.15.0-206.153.7.el9uek.x86_64.img
boot

where kernel = kernel form config options for kernel = options from config initrd = initrd from config IMPORTANT: when doing copy/pastes VERIFY that linux string is a single string, if you have newlines or returns in the buffer - they will NOT be applied. So when you have full linux string copied - paste it to some file to verify that it is a single string. do not forget that path is relative to your partition with /boot or /boot partition. If your /boot is on /root partition, you will need to find the disk with root partition and your paths will be something like (lvm/volume-root)/boot/

When system is booted run: grub2-mkconfig > /boot/grub2/grub.cfg grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg

aburmash avatar Jun 25 '24 10:06 aburmash

@robertm98 the problem is that on OL9, config file for grub2 was switched to parent config in /boot/efi/EFI/redhat/grub.cfg that in order loads proper /boot/grub2/grub.cfg config.

For CERTAIN /boot/efi/EFI/redhat/grub.cfg config contents fix that was applied for leapp in-place upgrade instead of correctly updating configs ( or not touching them ), writes /boot/efi/EFI/redhat/grub.cfg into /boot/grub2/grub.cfg and system chainloops.

aburmash avatar Jun 25 '24 11:06 aburmash

Thanks for the instructions, some remarks from my expierence: Running grubx64.efi after grub.cfg was deleted did not automatically put me into grub cmdline but was stuck and I needed to power-cycle the machine. ls (hd0,gpt1) only shows "Filesystems is fat" or "Filesystem is xfs", not actual contents. However ls (hd0,gpt2)/loader/entries would only succeed on the right disk and list its contents, and show not found on all others.

boot was successful, but after login + grub2-mkconfig + reboot it would return to grub cmdline again :/ Reading your latest comment I tried mkconfig to /boot/efi/EFI/redhat/grub.cfg and it seems to work now!

m45733r avatar Jun 25 '24 11:06 m45733r

ls (hd0,gpt1)

yeah, you need slash in the end to display content: ls (hd0,gpt1)/

boot was successful, but after login + grub2-mkconfig + reboot it would return to grub cmdline again :/

OH! yes, that is because /boot/efi/EFI/redhat/grub.cfg was removed from UEFI shell during recovery. I've updated my post to reflect that.

aburmash avatar Jun 25 '24 11:06 aburmash

Thank you.

robertm98 avatar Jun 25 '24 11:06 robertm98

Im not sure if that is related to the original issue but the only thing that is a bit weird now is that grubby shows:

[root@ol9-machine ~]# grubby --default-kernel
/boot/vmlinuz-5.15.0-207.156.6.el9uek.x86_64
[root@ol9-machine ~]# grubby --default-index
3
[root@ol9-machine ~]# grubby --info DEFAULT
index=3
kernel="/boot/vmlinuz-5.15.0-207.156.6.el9uek.x86_64"
args="ro rd.lvm.lv=ol/root rhgb quiet crashkernel=1G-64G:448M,64G-:512M $tuned_params"
root="/dev/mapper/ol-root"
initrd="/boot/initramfs-5.15.0-207.156.6.el9uek.x86_64.img $tuned_initrd"
title="Oracle Linux Server (5.15.0-207.156.6.el9uek.x86_64 with Unbreakable Enterprise Kernel) 9.4"
id="bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64"

And yet, when I reboot it would automatically select index 0 with a kernel that is no longer present in /boot. So the system is usable but wouldnt survive an automated reboot. See screenshot attached.

[root@ol9-machine ~]# uname -r
5.15.0-207.156.6.el9uek.x86_64
[root@ol9-machine ~]# dnf list installed | grep kernel
kernel.x86_64                         5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-core.x86_64                    5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-modules.x86_64                 5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-modules-core.x86_64            5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-tools.x86_64                   5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-tools-libs.x86_64              5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-uek.x86_64                     5.15.0-207.156.6.el9uek             @ol9_UEKR7
kernel-uek-core.x86_64                5.15.0-207.156.6.el9uek             @ol9_UEKR7
kernel-uek-modules.x86_64             5.15.0-207.156.6.el9uek             @ol9_UEKR7

Any help appreciated.

image

m45733r avatar Jun 25 '24 12:06 m45733r

can you show please for x in $(find /boot |grep grubenv); do echo $x; cat $x; done

cat /boot/efi/EFI/redhat/grub.cfg |grep grubenv
cat /boot/grub2/grub.cfg |grep grubenv

aburmash avatar Jun 25 '24 12:06 aburmash

Sure, here you go:

/boot/grub2/grubenv
# GRUB Environment Block
# WARNING: Do not edit this file by tools other than grub-editenv!!!
saved_entry=bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64
boot_success=1
boot_indeterminate=0


/boot/efi/EFI/redhat/grub.cfg

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.

/boot/grub2/grub.cfg

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.

m45733r avatar Jun 25 '24 12:06 m45733r

OK, everything above looks correct. Now ls /boot/loader/entries/

It seems you have some redundant entries there.

aburmash avatar Jun 25 '24 12:06 aburmash

[root@ol9-machine grub2]# ls -al /boot/loader/entries/
total 28
drwx------. 2 root root 4096 Jun 25 13:34 .
drwxr-xr-x. 3 root root   21 Oct 17  2022 ..
-rw-r--r--. 1 root root  440 May 22 13:59 495620e0609f491080cb4e769e86283d-0-rescue.conf
-rw-r--r--. 1 root root  381 May 22 13:59 495620e0609f491080cb4e769e86283d-5.14.0-284.30.1.el9_2.x86_64.conf
-rw-r--r--. 1 root root  428 May 22 13:59 495620e0609f491080cb4e769e86283d-5.15.0-200.131.27.el9uek.x86_64.conf
-rw-r--r--. 1 root root  405 May 22 13:59 bda9a182a36740ada28baaa218d5c09d-0-rescue.conf
-rw-r--r--. 1 root root  381 Jun 25 10:18 bda9a182a36740ada28baaa218d5c09d-5.14.0-427.22.1.el9_4.x86_64.conf
-rw-r--r--. 1 root root  424 Jun 25 10:19 bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64.conf

oh, heres the problem - sorry for bothering you - but thanks for pointing me in the right direction. looks like (some script or person) regenerated the machine-id a few weeks ago...

m45733r avatar Jun 25 '24 12:06 m45733r

For everyone tracking this issue: grub2 updates that does NOT contain scriptlet bug and, at the same time, resolves the issue for people who had installed broken package, but did not reboot, was published to public repositories:

version is 2.06-80.0.3.el9_4

aburmash avatar Jun 28 '24 16:06 aburmash

I'm running Oracle Linux Server 8.9 and am experiencing the same GRUB boot issue discussed here. After updating, my system gets stuck at the GRUB CLI on reboot.

Current package versions offered in my OL8 repos:

grub2-common.noarch      1:2.02-167.0.1.el8_10
grub2-pc.x86_64          1:2.02-167.0.1.el8_10
grub2-pc-modules.noarch  1:2.02-167.0.1.el8_10
grub2-tools.x86_64       1:2.02-167.0.1.el8_10
grub2-tools-efi.x86_64   1:2.02-167.0.1.el8_10
grub2-tools-extra.x86_64 1:2.02-167.0.1.el8_10
grub2-tools-minimal.x86_64 1:2.02-167.0.1.el8_10
# for x in $(find /boot |grep grubenv); do echo $x; cat $x; done
/boot/grub2/grubenv
# GRUB Environment Block
kernelopts=root=UUID=246acc24-9a5e-4f74-96c2-5a0496303213 ro crashkernel=auto LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200n8 rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 net.ifnames=1 nvme_core.shutdown_timeout=10 nvme_core.io_timeout=4294967295 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 ipmi_si.trydefaults=0 libiscsi.debug_libiscsi_eh=1 loglevel=4
boot_success=0

# cat /boot/efi/EFI/redhat/grub.cfg |grep grubenv
cat: /boot/efi/EFI/redhat/grub.cfg: No such file or directory
# cat /boot/grub2/grub.cfg |grep grubenv
if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then

After update, have not performed reboot yet

# for x in $(find /boot |grep grubenv); do echo $x; cat $x; done
/boot/grub2/grubenv
# GRUB Environment Block
kernelopts=root=UUID=246acc24-9a5e-4f74-96c2-5a0496303213 ro audit=1
boot_success=0

# cat /boot/efi/EFI/redhat/grub.cfg |grep grubenv
cat: /boot/efi/EFI/redhat/grub.cfg: No such file or directory
# cat /boot/grub2/grub.cfg |grep grubenv
if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.

Regenerating the grub config did not fix the issue and after reboot gets stuck at the GRUB CLI

grub2-mkconfig > /boot/grub2/grub.cfg
grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg 

Any guidance or updates for OL8 users would be appreciated!

sharma-the-karma avatar Jun 11 '25 19:06 sharma-the-karma

it is very unlikely it's the same issue.

When you say stuck and grub CLI - you mean you are dropped to grub command line ? please run grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg and attach grub.cfg here. Also verify that you have entries in /boot/loader/entries dir and they are not empty.

Is your system UEFI or legacy bios ? run ls /sys/firmware/efi/ to verify

aburmash avatar Jun 11 '25 19:06 aburmash

Thanks for your response.

Yes, by "stuck at the grub CLI," I mean that after rebooting, the system drops to the grub> command prompt rather than showing the normal boot menu.

This is one of the systems that I ran the update and I have not rebooted yet, and this is BIOS system

# grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg
Generating grub configuration file ...
done
# ls -la /boot/loader/entries/
total 28
drwx------. 2 root root 4096 Jun 11 19:53 .
drwxr-xr-x. 3 root root   21 Feb  2  2024 ..
-rw-r--r--  1 root root  333 Jun 11 19:51 ec23b0f92923f5903d19560197e45e85-4.18.0-513.5.1.el8_9.x86_64.conf
-rw-r--r--  1 root root  344 Jun 11 19:51 ec23b0f92923f5903d19560197e45e85-4.18.0-553.22.1.el8_10.x86_64.conf
-rw-r--r--  1 root root  344 Jun 11 19:52 ec23b0f92923f5903d19560197e45e85-4.18.0-553.56.1.el8_10.x86_64.conf
-rw-r--r--  1 root root  356 Jun 11 19:51 ec23b0f92923f5903d19560197e45e85-5.15.0-200.131.27.el8uek.x86_64.conf
-rw-r--r--  1 root root  356 Jun 11 19:51 ec23b0f92923f5903d19560197e45e85-5.15.0-300.163.18.el8uek.x86_64.conf
-rw-r--r--  1 root root  384 Jun 11 19:53 ec23b0f92923f5903d19560197e45e85-5.15.0-309.180.4.el8uek.x86_64.conf
# ls -la /sys/firmware/efi
ls: cannot access '/sys/firmware/efi': No such file or directory
# [[ -d /sys/firmware/efi ]] && echo UEFI || echo BIOS
BIOS

Here is the grub.cfg

grub.cfg.txt

sharma-the-karma avatar Jun 11 '25 20:06 sharma-the-karma

I think I know what might be the problem. Did you run grub2-install after update ? If not - you should.

aburmash avatar Jun 11 '25 21:06 aburmash

I did run grub2-install; do I need to regenerate the grub config again after the grub2-install?

# grub2-install
Installing for i386-pc platform.
grub2-install: error: install device isn't specified.
# lsblk
NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0  16G  0 disk
└─nvme0n1p1 259:1    0  16G  0 part /
# grub2-install /dev/nvme0n1
Installing for i386-pc platform.
Installation finished. No error reported.

FYI. The system booted without any issues after grub2-install.

sharma-the-karma avatar Jun 11 '25 22:06 sharma-the-karma

You should not need to regenerate grub config after grub update. On legacy ( BIOS/non-UEFI ) systems it is mandatory to rerun grub2-install after grub2 updates. This is needed so that code in core.img matches actual modules. NOT rerunning grub2-install might be fine for a while, if code is not changing much, but from time to time you might faces issues, like you did.

aburmash avatar Jun 12 '25 11:06 aburmash

Oracle Linux customers, please file your issue at https://support.oracle.com

Thanks for filing an issue with Oracle Linux.

GitHub Issues is not an official support channel and we don't offer product support here. If you're not yet an Oracle Linux customer, consider signing up at https://linux.oracle.com.

Even if you're not a customer, if we can confirm that an issue is a bug we will do our best to fix it and to update this issue once it has been fixed. We don't guarantee a fix or feedback and for now, we will close this issue. If you have Oracle Linux support, please use support.oracle.com to report issues.

YoderExMachina avatar Jun 12 '25 11:06 YoderExMachina