infix icon indicating copy to clipboard operation
infix copied to clipboard

Startup config sometimes fails to load (tested on x86_64)

Open mattiaswal opened this issue 1 year ago • 7 comments

Running the test case/ietf-hardware/usb.py in a loop (with kvm disabled), i noticed that sometimes it booted with failure config, 1 of 22 tests.

The USB tests remove the configured USB ports, copy ru st, reboot the unit. So should have nothing to do with USB, since the startup configuration has it removed.

[ OK ] Rebooting ...
       reboot: Restarting system
● ● ●  Infix — a Network Operating System v23.11.0-189-g565a22c ══════════
[ OK ] Mounting filesystems from /etc/fstab
[ OK ] Restoring system clock (UTC) from RTC
[ OK ] Seeding random number generator
[ OK ] Verifying D-Bus machine UUID
[ OK ] Starting Device event daemon (udev)
[ OK ] Requesting device events
[ OK ] Requesting subsystem events
[ OK ] Waiting for udev to finish
[ OK ] Starting System log daemon
[ OK ] Probing system information
[ OK ] Probing network interfaces
[ OK ] Starting D-Bus message bus daemon
[ OK ] Bootstrapping YANG datastore
[ OK ] Starting Configuration daemon
[ OK ] Starting DHCP/DNS proxy
[ OK ] Starting LED daemon
[ OK ] Verifying SSH host keys
[ OK ] Starting System watchdog daemon
[FAIL] Loading startup-config
[ OK ] Loading failure-config
[ OK ] Starting Status daemon

[ OK ] Saving system time (UTC) to RTC
[ OK ] Saving random seed
[ OK ] Stopping Device event daemon (udev)
[ OK ] Stopping Configuration daemon
[ OK ] Stopping NETCONF server
[ OK ] Stopping DHCP/DNS proxy
[ OK ] Stopping CLI backend daemon
[ OK ] Stopping Software update service
[ OK ] Stopping OpenSSH daemon
[ OK ] Stopping Status daemon
[ OK ] Stopping Zebra routing daemon
[ OK ] Stopping LLDP daemon (IEEE 802.1ab)
[ OK ] Unmounting filesystems ...
[ OK ] Calling hook/svc/down scripts ...
[ OK ] Rebooting ...
       reboot: Restarting system
● ● ●  Infix — a Network Operating System v23.11.0-189-g565a22c ══════════

mattiaswal avatar Feb 23 '24 13:02 mattiaswal

Um, does it reboot on its own there (after "Starting Status daemon"), or did you send an RPC?

troglobit avatar Feb 23 '24 13:02 troglobit

The test sends an RPC, the test is run in a loop.

mattiaswal avatar Feb 23 '24 13:02 mattiaswal

Yeah, would have to debug that live to determine what's going on

troglobit avatar Feb 23 '24 13:02 troglobit

Bug reproduced, took a long time, was running all day, and left it running over night, and on the morning it had failed: Forcing KVM to off (since i had only seen it in cloud before, unknown if it occurs with KVM):

+++ b/board/x86_64/board.mk
@@ -7,7 +7,7 @@ test-env = $(test-dir)/env \
        -p $(BINARIES_DIR)/infix-x86_64.pkg \
        $(1) $(2)
 
-test-env-qeneth = $(call test-env,-q $(test-dir)/virt/quad,$(1))
+test-env-qeneth = $(call test-env,-K -q $(test-dir)/virt/quad,$(1))
 test-env-run    = $(call test-env,-C -t $(BINARIES_DIR)/qemu.dot,$(1))
 
 .PHONY: test-%

Then running the test misc/case/start_from_startup.py using the following script: while [ true ]; do ./start_from_startup.py || break ; done

use the weekend it takes a very long time, sometimes.

Feb 25 23:56:59 infix dagger[2460]: Evolved to generation 0
Feb 25 23:56:59 infix confd[2460]: [INF] EV LISTEN: "ietf-interfaces" "done" ID 1 priority 65535 success (remaining 0 subscribers). 
Feb 25 23:56:59 infix confd[2460]: [INF] EV LISTEN: "ietf-system" "done" ID 1 priority 65535 processing (remaining 6 subscribers). 
Feb 25 23:56:59 infix confd[2460]: New user "admin" created 
Feb 25 23:57:00 infix confd[2460]: User admin created
Feb 25 23:57:04 infix confd[2460]: Password updated for user admin
Feb 25 23:57:04 infix confd[2460]: [INF] EV LISTEN: "ietf-system" "done" ID 1 priority 65535 success (remaining 0 subscribers). 
Feb 25 23:57:04 infix startup[2480]: [ERR] Waiting on a conditional variable failed (_sr_shmsub_notify_wait_wr: Connection timed out).
Feb 25 23:57:04 infix startup[2480]: sysrepocfg error: Replace config failed (Timeout expired)
Feb 25 23:57:04 infix confd[2460]: [INF] EV LISTEN: "infix-services" "done" ID 1 priority 65535 processing (remaining 1 subscribers). 
Feb 25 23:57:04 infix confd[2460]: [INF] EV LISTEN: "infix-services" "done" ID 1 priority 65535 processing success (after timeout or earlier error). 
Feb 25 23:57:04 infix startup[2480]: Feb 25 2024 23:57:04 root[2595]: Failed loading /cfg/startup-config.cfg, reverting to Fail Secure mode!
Feb 25 23:57:04 infix root[2595]: Failed loading /cfg/startup-config.cfg, reverting to Fail Secure mode!
Feb 25 23:57:04 infix finit[1]: Starting failure[2596]
Feb 25 23:57:05 infix failure[2596]: [INF] Connection 7 created.
Feb 25 23:57:05 infix failure[2596]: [INF] Session 29 (user "root", CID 7) created.

Either a finit-bug or a sysrepo-bug, test again after sysrepo upgrade.

mattiaswal avatar Feb 26 '24 09:02 mattiaswal

Do you have any timestamps from the logs?

It could be that the system is so slow that we're hitting this:

/** default timeout for change subscription callback (ms) */
#define SR_CHANGE_CB_TIMEOUT 5000

In that case, it would not really be a bug, per se. Though we might want to increase the timeout I suppose.

wkz avatar Feb 26 '24 10:02 wkz

Do you have any timestamps from the logs?

It could be that the system is so slow that we're hitting this:

/** default timeout for change subscription callback (ms) */
#define SR_CHANGE_CB_TIMEOUT 5000

In that case, it would not really be a bug, per se. Though we might want to increase the timeout I suppose.

I also considered that, very good suggestion to look at when we pick this one up again!

troglobit avatar Feb 26 '24 12:02 troglobit

Since this was seen on virtual without KVM the timeout scenario seems very likely.

mattiaswal avatar Feb 26 '24 12:02 mattiaswal

Tried to reproduce, ran test over night. No failure. Close for now.

mattiaswal avatar Aug 27 '24 05:08 mattiaswal