systemd-bootchart fails with ENOENT for "/proc/schedstat" when run from initial ramdisk
Executing systemd-bootchart from the initial ramdisk fails when systemd does its switch-root procedure.
This can be reproduced on Fedora 41 with an initial ramdisk updated to include systemd-bootchart.
The systemd-bootchart documentation does not mention if execution from the initial ramdisk is supported or not, but, internally, systemd-bootchart sets argv[0][0] = '@', so it seems like this was supported at one point (setting argv[0][0] = '@' is one way to survive the switch-root process killing spree).
The failure happens here:
https://github.com/systemd/systemd-bootchart/blob/a15bcafb60b9a24d866024953e9965316ba73eaf/src/store.c#L191C1-L194C71
I will provide an strace log and more detailed steps to reproduce below.
strace log is attached.
To prepare an initial ramdisk with systemd-bootchart (and strace), you can do this:
sudo -i
mkdir -p initrd/root
cd initrd/root
gunzip --stdout /boot/initramfs-6.11.4-301.fc41.aarch64.img | cpio --extract
cd usr/lib/systemd
cp /usr/lib/systemd/systemd-bootchart .
# you can check that all required libs are already present
# ldd /usr/lib/systemd/systemd-bootchart
cd ../..
cd bin
cp /usr/bin/strace .
# you will need to copy a few libs to support strace
# ldd /usr/bin/strace
cd ../..
find . | cpio -o -H newc --file=../initramfs-xx.cpio
cd ..
gzip --stdout initramfs-xx.cpio > initramfs-xx.img
cp initramfs-xx.img /boot/initramfs-xx.img
Reboot and then edit the grub command to boot using the new initial ramdisk:
initrd ($root)/initramfs-xx.img
Also, add a kernel param to boot into the rd.emergency target (I also added enforcing=0):
$ xargs -n1 < /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.11.4-301.fc41.aarch64
root=/dev/mapper/fedora_vbox-root
ro
rd.lvm.lv=fedora_vbox/root
rhgb
enforcing=0
rd.emergency
In the ramdisk emergency shell, run bootchart and then exit to continue booting:
# strace -o /run/log/strace.log /usr/lib/systemd/systemd-bootchart &
# exit
When you login as normal, systemd-bootchart won't be running.
The strace log shows that systemd-bootchart failed attempting to read /proc/schedstat and then exited:
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=32804124}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1, si_uid=0} ---
restart_syscall(<... resuming interrupted clock_nanosleep ...>) = 0
lseek(4, 0, SEEK_SET) = 0
pread64(5, "nr_free_pages 478087\nnr_zone_ina"..., 4095, 0) = 3531
openat(AT_FDCWD, "/proc/schedstat", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
writev(2, [{iov_base="Unable to read schedstat: No suc"..., iov_len=51}, {iov_base="\n", iov_len=1}], 2) = -1 EIO (Input/output error)
getpid() = 241
close(3) = 0
close(4) = 0
exit_group(1) = ?
+++ exited with 1 +++
You can also reproduce the defect by setting the kernel param rdinit=:
rdinit=/usr/lib/systemd/systemd-bootchart
This certainly wasn't supported.
I think we can, though. I think we might have to rewrite all the proc opening code to open the "correct" proc folder, somehow detect and fallback to the "new" location of proc and instead of opening file by full path, use openat on the existing proc directory fd. It's likely going to be a little messy because for each process, we will be opening files relative to the proc folder.
That's assuming that it actually works and the fd for /proc remains accessible after the switchroot.
We have a patch that works the way you suggest; i.e. rather than use an absolute path, it holds a file descriptor to the original /proc (pre-switch-root) and then opens relative to that fd.
It seems to work.
I will test it a bit more and then open a PR.