sysbox-mgr failed to start
We had a node where sysbox-mgr failed to start resulting in no container able to launch with sysbox.
It looks like sysbox-mgr failed to start on time, I attached the logs but they don't tell me much more. Is there anything else where I can look for more info?
Also would it make sense to have this systemd service restart when failing? I could make a PR for that.
logs:
root@ip-10-0-1-129:/# journalctl -u sysbox-mgr
-- Logs begin at Wed 2023-06-21 09:32:47 UTC, end at Wed 2023-06-21 11:04:30 UTC. --
Jun 21 09:33:54 ip-172-31-47-216 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Starting ..."
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Sysbox data root: /var/lib/sysbox"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Shiftfs module found in kernel: no"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Shiftfs works properly: no"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Shiftfs-on-overlayfs works properly: no"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="ID-mapped mounts supported by kernel: yes"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Operating in system container mode."
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Inner container image preloading enabled."
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Listening on /run/sysbox/sysmgr.sock"
Jun 21 09:33:54 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:33:54" level=info msg="Ready ..."
Jun 21 09:33:54 ip-172-31-47-216 systemd[1]: Started sysbox-mgr (part of the Sysbox container runtime).
Jun 21 09:47:18 ip-172-31-47-216 systemd[1]: Stopping sysbox-mgr (part of the Sysbox container runtime)...
Jun 21 09:47:18 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:47:18" level=info msg="Caught OS signal: terminated"
Jun 21 09:47:18 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:47:18" level=info msg="Stopping (gracefully) ..."
Jun 21 09:47:18 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:47:18" level=info msg=Stopped.
Jun 21 09:47:18 ip-172-31-47-216 sysbox-mgr[6506]: time="2023-06-21 09:47:18" level=info msg=Exiting.
Jun 21 09:47:18 ip-172-31-47-216 systemd[1]: sysbox-mgr.service: Succeeded.
Jun 21 09:47:18 ip-172-31-47-216 systemd[1]: Stopped sysbox-mgr (part of the Sysbox container runtime).
-- Reboot --
Jun 21 10:45:27 ip-10-0-1-129 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jun 21 10:45:27 ip-10-0-1-129 sysbox-mgr[552]: time="2023-06-21 10:45:27" level=info msg="Starting ..."
Jun 21 10:45:27 ip-10-0-1-129 sysbox-mgr[552]: time="2023-06-21 10:45:27" level=info msg="Sysbox data root: /var/lib/sysbox"
Jun 21 10:46:12 ip-10-0-1-129 systemd[1]: sysbox-mgr.service: start operation timed out. Terminating.
Jun 21 10:46:12 ip-10-0-1-129 systemd[1]: sysbox-mgr.service: Failed with result 'timeout'.
Jun 21 10:46:12 ip-10-0-1-129 systemd[1]: Failed to start sysbox-mgr (part of the Sysbox container runtime).
@stijndehaes, thanks for reporting this one. A few questions below:
- Was this a one-off event? Or are you consistently reproducing the issue in this node?
- Can you enable
debuglogs and try to repro again? See here if need any help. - Please confirm that you're running the latest Sysbox release (v0.6.2).
- Also, which kernel are you running?
@rodnymolina:
- It doesn't happen always. Only every now and then. I just had it happen again
- I will enable this, and the next time it happens I will be able to pass these logs
- I am running v0.6.2
- 5.15.0-1037-aws, I am running the latest EKS 1.26 ubuntu image from here: https://cloud-images.ubuntu.com/docs/aws/eks/
@rodnymolina as you can see in the journalctl logs we prebuild an AMI, and then start from that. The sysbox-mgr logs never show that the shiftfs check passed, it appears to be hanging on that check.
We use the following systemd config:
[Unit]
Description=sysbox-mgr (part of the Sysbox container runtime)
PartOf=sysbox.service
[Service]
Type=simple
Type=notify
#We set --allow-trusted-xattr=false improve performance
#https://github.com/nestybox/sysbox/blob/master/docs/user-guide/configuration.md#speeding-up-sysbox-by-disallowing-trusted-overlay-xattributes
ExecStart=/usr/bin/sysbox-mgr --disable-inner-image-preload --log-level=debug --allow-trusted-xattr=false --log=/var/log/sysbox-mgr.log
TimeoutStartSec=45
TimeoutStopSec=90
StartLimitInterval=0
NotifyAccess=main
OOMScoreAdjust=-500
# The number of files opened by sysbox-mgr is a function of the number of
# containers and the size of the rootfs within them. Thus we set the limit to
# infinite so to prevent "too many open files" errors.
LimitNOFILE=infinity
LimitNPROC=infinity
[Install]
WantedBy=sysbox.service
Sysbox-mgr logs:
time="2023-06-26 09:06:34" level=info msg="Starting ..."
time="2023-06-26 09:06:34" level=info msg="Sysbox data root: /var/lib/sysbox"
time="2023-06-26 09:06:35" level=debug msg="Running shiftfs check on host."
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: found shiftfs module."
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: test dir = /var/lib/sysbox/sysbox-shiftfs-check2364347436/test (ext4)"
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: marked shiftfs on /var/lib/sysbox/sysbox-shiftfs-check2364347436/test"
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: spawning child process (700) into user-ns"
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: passed"
time="2023-06-26 09:06:35" level=debug msg="Running shiftfs-on-overlayfs check on host."
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: found shiftfs module."
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: test dir = /var/lib/sysbox/sysbox-shiftfs-check762780448/test (ext4)"
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: mounted overlayfs on /var/lib/sysbox/sysbox-shiftfs-check762780448/test"
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: marked shiftfs on /var/lib/sysbox/sysbox-shiftfs-check762780448/test"
time="2023-06-26 09:06:35" level=debug msg="- shiftfs check: spawning child process (707) into user-ns"
Journalctl logs:
-- Logs begin at Fri 2023-06-23 23:04:53 UTC, end at Mon 2023-06-26 09:21:46 UTC. --
Jun 23 23:05:33 ip-10-2-151-192 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jun 23 23:05:33 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:33" level=info msg="Starting ..."
Jun 23 23:05:33 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:33" level=info msg="Sysbox data root: /var/lib/sysbox"
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Shiftfs module found in kernel: no"
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Shiftfs works properly: no"
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Shiftfs-on-overlayfs works properly: no"
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="ID-mapped mounts supported by kernel: yes"
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Operating in system container mode."
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Inner container image preloading enabled."
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Listening on /run/sysbox/sysmgr.sock"
Jun 23 23:05:34 ip-10-2-151-192 systemd[1]: Started sysbox-mgr (part of the Sysbox container runtime).
Jun 23 23:05:34 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:05:34" level=info msg="Ready ..."
Jun 23 23:08:46 ip-10-2-151-192 systemd[1]: sysbox-mgr.service: Current command vanished from the unit file, execution of the command l
ist won't be resumed.
Jun 23 23:10:28 ip-10-2-151-192 systemd[1]: Stopping sysbox-mgr (part of the Sysbox container runtime)...
Jun 23 23:10:28 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:10:28" level=info msg="Caught OS signal: terminated"
Jun 23 23:10:28 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:10:28" level=info msg="Stopping (gracefully) ..."
Jun 23 23:10:28 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:10:28" level=info msg=Stopped.
Jun 23 23:10:28 ip-10-2-151-192 sysbox-mgr[6389]: time="2023-06-23 23:10:28" level=info msg=Exiting.
Jun 23 23:10:28 ip-10-2-151-192 systemd[1]: sysbox-mgr.service: Succeeded.
Jun 23 23:10:28 ip-10-2-151-192 systemd[1]: Stopped sysbox-mgr (part of the Sysbox container runtime).
-- Reboot --
Jun 26 09:06:33 ip-10-2-154-88 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jun 26 09:07:19 ip-10-2-154-88 systemd[1]: sysbox-mgr.service: start operation timed out. Terminating.
Jun 26 09:07:19 ip-10-2-154-88 systemd[1]: sysbox-mgr.service: Failed with result 'timeout'.
Jun 26 09:07:19 ip-10-2-154-88 systemd[1]: Failed to start sysbox-mgr (part of the Sysbox container runtime).
(Chiming in while on PTO ...)
I've seen this occur sporadically too, and it appears to occur when sysbox-mgr checks if another instance of sysbox-mgr is present (by checking for the presence of the /run/sysbox/sysmgr.pid file).
One way to prove this would be to modify the sysbox-mgr systemd unit file to erase to kill any sysbox-mgr process and erase that file prior to starting sysbox-mgr.
@ctalledo thanks for that feedback, next time I see this I will check for that file. As to how to make sysbox-mgr erase that fill, I am not sure how to best achieve that
Hi @stijndehaes,
As to how to make sysbox-mgr erase that fill, I am not sure how to best achieve that
I think that can be achieved by editing the sysbox-mgr systemd unit file so that it erases the file before starting sysbox-mgr (i.e., modifying the execStart line). Not ideal since it deletes a file that is normally deleted by sysbox-mgr when it terminates, but may provide you a work-around.
@ctalledo I wanted to give you an update. So adding:
ExecStartPre=rm -rf /run/sysbox/sysmgr.pid
to the service resulted in a lot less failures, but even then sometimes it fails to start up:
oot@ip-10-2-139-65:/var/snap/amazon-ssm-agent/6563# systemctl status sysbox-mgr
● sysbox-mgr.service - sysbox-mgr (part of the Sysbox container runtime)
Loaded: loaded (/lib/systemd/system/sysbox-mgr.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2023-08-03 17:34:10 UTC; 18h ago
Main PID: 551 (code=killed, signal=TERM)
Aug 03 17:33:25 ip-10-2-139-65 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Aug 03 17:34:10 ip-10-2-139-65 systemd[1]: sysbox-mgr.service: start operation timed out. Terminating.
Aug 03 17:34:10 ip-10-2-139-65 systemd[1]: sysbox-mgr.service: Failed with result 'timeout'.
Aug 03 17:34:10 ip-10-2-139-65 systemd[1]: Failed to start sysbox-mgr (part of the Sysbox container runtime).
In the logs of sysbox-mgr I can only see the following:
root@ip-10-2-139-65:/var/snap/amazon-ssm-agent/6563# cat /var/log/sysbox-mgr.log
time="2023-08-03 17:33:25" level=info msg="Starting ..."
time="2023-08-03 17:33:25" level=info msg="Sysbox data root: /var/lib/sysbox"
time="2023-08-03 17:33:25" level=debug msg="Running shiftfs check on host."
time="2023-08-03 17:33:25" level=debug msg="- shiftfs check: found shiftfs module."
time="2023-08-03 17:33:25" level=debug msg="- shiftfs check: test dir = /var/lib/sysbox/sysbox-shiftfs-check2306717947/test (ext4)"
time="2023-08-03 17:33:25" level=debug msg="- shiftfs check: marked shiftfs on /var/lib/sysbox/sysbox-shiftfs-check2306717947/test"
time="2023-08-03 17:33:25" level=debug msg="- shiftfs check: spawning child process (682) into user-ns"
Is there anything else I can look at to see what the issue might be?
I have left the node running so I can debug it further if needed :)
I still see this issue happening @ctalledo.
A sudo systemctl start sysbox-mgr fixes the problem.
Hi @stijndehaes, thanks for following up and apologies for the delayed response.
Regarding the sysbox-mgr logs you showed earlier:
time="2023-08-03 17:33:25" level=debug msg="- shiftfs check: spawning child process (682) into user-ns"
The fact that it gets stuck there tells me there's some bug in the sysbox-mgr code that checks if shiftfs is working properly.
You can bypass that check by adding the --disable-shiftfs-precheck flag to the sysbox-mgr command line in the systemd unit. So based on what you showed above, it would look like:
ExecStart=/usr/bin/sysbox-mgr --disable-shiftfs-precheck --disable-inner-image-preload --log-level=debug --allow-trusted-xattr=false --log=/var/log/sysbox-mgr.log
Bypassing the check does not mean Sysbox won't use shiftfs, rather it means Sysbox will use it without first empirically verifying whether it works or not.
Another thing that would help me is if you can enable debug logging in sysbox-mgr and reproduce, and then post the output of the sysbox-mgr log. To enable debug logging, you add the --log-level=debug flag to the sysbox-mgr command line. That would produce more information allowing me to better debug the exact cause of the failure.
Thanks!