core-base icon indicating copy to clipboard operation
core-base copied to clipboard

Failed pstore modules with core22 20220706

Open plars opened this issue 3 years ago • 3 comments

We are testing core22 20220706 in beta and hitting this issue frequently on both amd64 and arm devices. It doesn't happen 100% of the time, but it seems to be pretty easy to hit, and easy to reproduce by simply installing core22 from beta channel and rebooting. After rebooting, it looks like there are quite a few failed systemd units if you run this:

$ systemctl --system --no-ask-password --no-pager list-units --state=failed
Failed units:
  UNIT                             LOAD   ACTIVE SUB    DESCRIPTION
● modprobe@chromeos_pstore.service loaded failed failed Load Kernel Module chromeos_pstore
● modprobe@efi_pstore.service      loaded failed failed Load Kernel Module efi_pstore
● [email protected]       loaded failed failed Load Kernel Module mtdpstore
● modprobe@pstore_blk.service      loaded failed failed Load Kernel Module pstore_blk
● modprobe@pstore_zone.service     loaded failed failed Load Kernel Module pstore_zone
● [email protected]         loaded failed failed Load Kernel Module ramoops

This persists across future reboots, but if I revert back to the previous stable core22, the problem goes away.

plars avatar Jul 20 '22 13:07 plars

In latest systemd update in jammy, these two lines (that come from https://github.com/systemd/systemd/commit/70e74a5997ae2ce7ba72a74ac949c3b2dad1a1d6) are added to systemd-pstore.service:

After=modprobe@efi_pstore.service [email protected] modprobe@chromeos_pstore.service [email protected] modprobe@pstore_zone.service modprobe@pstore_blk.service
Wants=modprobe@efi_pstore.service [email protected] modprobe@chromeos_pstore.service [email protected] modprobe@pstore_zone.service modprobe@pstore_blk.service

The intention is to load any possible pstore provider before trying to start systemd-pstore. Note that actually if the loading of the kernel module fails, that should be fine as - is used as a prefix in [email protected] (this makes the status successful even if modprobe returns with error):

...
[Service]
Type=oneshot
ExecStart=-/sbin/modprobe -abq %I

The interesting thing is that these services do not fail because of a failure to load the module, but because an excessive number of retries:

$ sudo journalctl -u modprobe@chromeos_pstore.service
Jul 20 09:02:37 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Deactivated successfully.
Jul 20 09:02:37 ubuntu systemd[1]: Finished Load Kernel Module chromeos_pstore.
Jul 20 09:02:38 ubuntu systemd[1]: Starting Load Kernel Module chromeos_pstore...
Jul 20 09:02:38 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Deactivated successfully.
Jul 20 09:02:38 ubuntu systemd[1]: Finished Load Kernel Module chromeos_pstore.
Jul 20 09:02:38 ubuntu systemd[1]: Starting Load Kernel Module chromeos_pstore...
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Deactivated successfully.
Jul 20 09:02:39 ubuntu systemd[1]: Finished Load Kernel Module chromeos_pstore.
Jul 20 09:02:39 ubuntu systemd[1]: Starting Load Kernel Module chromeos_pstore...
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Deactivated successfully.
Jul 20 09:02:39 ubuntu systemd[1]: Finished Load Kernel Module chromeos_pstore.
Jul 20 09:02:39 ubuntu systemd[1]: Starting Load Kernel Module chromeos_pstore...
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Deactivated successfully.
Jul 20 09:02:39 ubuntu systemd[1]: Finished Load Kernel Module chromeos_pstore.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Start request repeated too quickly.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Failed with result 'start-limit-hit'.
Jul 20 09:02:39 ubuntu systemd[1]: Failed to start Load Kernel Module chromeos_pstore.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Start request repeated too quickly.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Failed with result 'start-limit-hit'.
Jul 20 09:02:39 ubuntu systemd[1]: Failed to start Load Kernel Module chromeos_pstore.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Start request repeated too quickly.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Failed with result 'start-limit-hit'.
Jul 20 09:02:39 ubuntu systemd[1]: Failed to start Load Kernel Module chromeos_pstore.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Start request repeated too quickly.
Jul 20 09:02:39 ubuntu systemd[1]: modprobe@chromeos_pstore.service: Failed with result 'start-limit-hit'.
Jul 20 09:02:39 ubuntu systemd[1]: Failed to start Load Kernel Module chromeos_pstore.

And this happens because systemd-pstore.service is being called a few consecutive times (each of these lines is a restart):

$ sudo journalctl -u systemd-pstore.service
Jul 20 09:02:37 ubuntu systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Jul 20 09:02:39 ubuntu systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Jul 20 09:02:39 ubuntu systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Jul 20 09:02:39 ubuntu systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Jul 20 09:02:39 ubuntu systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Jul 20 09:02:39 ubuntu systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.

Interestingly, I see the same on my laptop with jammy:

$ sudo journalctl -b 0 -u systemd-pstore.service
jul 20 07:49:39 numancia systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
jul 20 07:49:40 numancia systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
jul 20 07:49:40 numancia systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
jul 20 07:49:41 numancia systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.

It is just that the number of restarts is smaller so it does not trigger errors in the services loading kernel modules because of excessive number of retries.

To summarize, the problem here is that systemd-pstore.service is being tried too many times, we need to understand why that happens. It should be tried once, and as the condition ConditionDirectoryNotEmpty=/sys/fs/pstore fails, it should just stop there. (Trying to load modules is to try to have something in /sys/fs/pstore before starting the service.)

alfonsosanchezbeato avatar Jul 20 '22 14:07 alfonsosanchezbeato

In latest systemd update in jammy, these two lines (that come from systemd/systemd@70e74a5) are added to systemd-pstore.service:

This commit has not yet been released upstream. On jammy, it is applied downstream.

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1978079

Unfortunately this one is not applied: https://github.com/systemd/systemd/commit/9625350e5381a68c1179ae4581e7586c206663e1

I suggest we try with this patch added.

valentindavid avatar Jul 21 '22 08:07 valentindavid

LP bug for systemd: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1982462

alfonsosanchezbeato avatar Jul 21 '22 09:07 alfonsosanchezbeato