Startup config sometimes fails to load (tested on x86_64)
Running the test case/ietf-hardware/usb.py in a loop (with kvm disabled), i noticed that sometimes it booted with failure config, 1 of 22 tests.
The USB tests remove the configured USB ports, copy ru st, reboot the unit. So should have nothing to do with USB, since the startup configuration has it removed.
[ OK ] Rebooting ...
reboot: Restarting system
● ● ● Infix — a Network Operating System v23.11.0-189-g565a22c ══════════
[ OK ] Mounting filesystems from /etc/fstab
[ OK ] Restoring system clock (UTC) from RTC
[ OK ] Seeding random number generator
[ OK ] Verifying D-Bus machine UUID
[ OK ] Starting Device event daemon (udev)
[ OK ] Requesting device events
[ OK ] Requesting subsystem events
[ OK ] Waiting for udev to finish
[ OK ] Starting System log daemon
[ OK ] Probing system information
[ OK ] Probing network interfaces
[ OK ] Starting D-Bus message bus daemon
[ OK ] Bootstrapping YANG datastore
[ OK ] Starting Configuration daemon
[ OK ] Starting DHCP/DNS proxy
[ OK ] Starting LED daemon
[ OK ] Verifying SSH host keys
[ OK ] Starting System watchdog daemon
[FAIL] Loading startup-config
[ OK ] Loading failure-config
[ OK ] Starting Status daemon
[ OK ] Saving system time (UTC) to RTC
[ OK ] Saving random seed
[ OK ] Stopping Device event daemon (udev)
[ OK ] Stopping Configuration daemon
[ OK ] Stopping NETCONF server
[ OK ] Stopping DHCP/DNS proxy
[ OK ] Stopping CLI backend daemon
[ OK ] Stopping Software update service
[ OK ] Stopping OpenSSH daemon
[ OK ] Stopping Status daemon
[ OK ] Stopping Zebra routing daemon
[ OK ] Stopping LLDP daemon (IEEE 802.1ab)
[ OK ] Unmounting filesystems ...
[ OK ] Calling hook/svc/down scripts ...
[ OK ] Rebooting ...
reboot: Restarting system
● ● ● Infix — a Network Operating System v23.11.0-189-g565a22c ══════════
Um, does it reboot on its own there (after "Starting Status daemon"), or did you send an RPC?
The test sends an RPC, the test is run in a loop.
Yeah, would have to debug that live to determine what's going on
Bug reproduced, took a long time, was running all day, and left it running over night, and on the morning it had failed: Forcing KVM to off (since i had only seen it in cloud before, unknown if it occurs with KVM):
+++ b/board/x86_64/board.mk
@@ -7,7 +7,7 @@ test-env = $(test-dir)/env \
-p $(BINARIES_DIR)/infix-x86_64.pkg \
$(1) $(2)
-test-env-qeneth = $(call test-env,-q $(test-dir)/virt/quad,$(1))
+test-env-qeneth = $(call test-env,-K -q $(test-dir)/virt/quad,$(1))
test-env-run = $(call test-env,-C -t $(BINARIES_DIR)/qemu.dot,$(1))
.PHONY: test-%
Then running the test misc/case/start_from_startup.py using the following script: while [ true ]; do ./start_from_startup.py || break ; done
use the weekend it takes a very long time, sometimes.
Feb 25 23:56:59 infix dagger[2460]: Evolved to generation 0 Feb 25 23:56:59 infix confd[2460]: [INF] EV LISTEN: "ietf-interfaces" "done" ID 1 priority 65535 success (remaining 0 subscribers). Feb 25 23:56:59 infix confd[2460]: [INF] EV LISTEN: "ietf-system" "done" ID 1 priority 65535 processing (remaining 6 subscribers). Feb 25 23:56:59 infix confd[2460]: New user "admin" created Feb 25 23:57:00 infix confd[2460]: User admin created Feb 25 23:57:04 infix confd[2460]: Password updated for user admin Feb 25 23:57:04 infix confd[2460]: [INF] EV LISTEN: "ietf-system" "done" ID 1 priority 65535 success (remaining 0 subscribers). Feb 25 23:57:04 infix startup[2480]: [ERR] Waiting on a conditional variable failed (_sr_shmsub_notify_wait_wr: Connection timed out). Feb 25 23:57:04 infix startup[2480]: sysrepocfg error: Replace config failed (Timeout expired) Feb 25 23:57:04 infix confd[2460]: [INF] EV LISTEN: "infix-services" "done" ID 1 priority 65535 processing (remaining 1 subscribers). Feb 25 23:57:04 infix confd[2460]: [INF] EV LISTEN: "infix-services" "done" ID 1 priority 65535 processing success (after timeout or earlier error). Feb 25 23:57:04 infix startup[2480]: Feb 25 2024 23:57:04 root[2595]: Failed loading /cfg/startup-config.cfg, reverting to Fail Secure mode! Feb 25 23:57:04 infix root[2595]: Failed loading /cfg/startup-config.cfg, reverting to Fail Secure mode! Feb 25 23:57:04 infix finit[1]: Starting failure[2596] Feb 25 23:57:05 infix failure[2596]: [INF] Connection 7 created. Feb 25 23:57:05 infix failure[2596]: [INF] Session 29 (user "root", CID 7) created.
Either a finit-bug or a sysrepo-bug, test again after sysrepo upgrade.
Do you have any timestamps from the logs?
It could be that the system is so slow that we're hitting this:
/** default timeout for change subscription callback (ms) */
#define SR_CHANGE_CB_TIMEOUT 5000
In that case, it would not really be a bug, per se. Though we might want to increase the timeout I suppose.
Do you have any timestamps from the logs?
It could be that the system is so slow that we're hitting this:
/** default timeout for change subscription callback (ms) */ #define SR_CHANGE_CB_TIMEOUT 5000In that case, it would not really be a bug, per se. Though we might want to increase the timeout I suppose.
I also considered that, very good suggestion to look at when we pick this one up again!
Since this was seen on virtual without KVM the timeout scenario seems very likely.
Tried to reproduce, ran test over night. No failure. Close for now.