frr
frr copied to clipboard
mgmtd is crashing in rare cases and test failure more often
Description
When running the rip_passive_interface test in parallel I am getting a consistent crash with this decode:
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139778446252480, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f20b4842476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f20b4d4706e in core_handler (signo=6, siginfo=0x7ffe9bf1f130, context=0x7ffe9bf1f000) at lib/sigevent.c:268
#5 <signal handler called>
#6 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:44
#7 __pthread_kill_internal (signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:78
#8 __GI___pthread_kill (threadid=139778446252480, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9 0x00007f20b4842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007f20b48287f3 in __GI_abort () at ./stdlib/abort.c:79
#11 0x00007f20b4d901be in _zlog_assert_failed (xref=0x559847a0ef80 <_xref.73>, extra=0x0) at lib/zlog.c:789
#12 0x00005598479cbeae in mgmt_txn_notify_be_txn_reply (txn_id=4, create=true, success=true, adapter=0x55984bba8230) at mgmtd/mgmt_txn.c:1524
#13 0x00005598479bc266 in be_adapter_handle_native_msg (adapter=0x55984bba8230, msg=0x55984bbd84d8, msg_len=32) at mgmtd/mgmt_be_adapter.c:628
#14 0x00005598479bc67b in mgmt_be_adapter_process_msg (version=1 '\001', data=0x55984bbd84d8 "\016", len=32, conn=0x55984b7a4bf0) at mgmtd/mgmt_be_adapter.c:696
#15 0x00007f20b4cf3070 in mgmt_msg_procbufs (ms=0x55984b7a4bf8, handle_msg=0x5598479bc623 <mgmt_be_adapter_process_msg>, user=0x55984b7a4bf0, debug=false) at lib/mgmt_msg.c:188
#16 0x00007f20b4cf3f7a in msg_conn_proc_msgs (thread=0x7ffe9bf201c0) at lib/mgmt_msg.c:521
#17 0x00007f20b4d62abe in event_call (thread=0x7ffe9bf201c0) at lib/event.c:2005
#18 0x00007f20b4cd51a0 in frr_run (loop=0x55984b6bae10) at lib/libfrr.c:1247
#19 0x00005598479b87f1 in main (argc=7, argv=0x7ffe9bf203f8) at mgmtd/mgmt_main.c:292
(gdb) f 12
#12 0x00005598479cbeae in mgmt_txn_notify_be_txn_reply (txn_id=4, create=true, success=true, adapter=0x55984bba8230) at mgmtd/mgmt_txn.c:1524
1524 assert(txn->commit_cfg_req);
(gdb) p txn->commit_cfg_req
$1 = (struct mgmt_txn_req *) 0x0
(gdb) l
1519 return -1;
1520
1521 if (!create && !txn->commit_cfg_req)
1522 return 0;
1523
1524 assert(txn->commit_cfg_req);
1525 cmtcfg_req = &txn->commit_cfg_req->req.commit_cfg;
1526 if (create) {
1527 if (success) {
1528 /*
(gdb) p create
$2 = true
(gdb)
In addition I am getting regular test failures for this test. Sometimes mgmtd complains about this:
2025-06-12 09:22:11,888 DEBUG: r2: cmd_status("/bin/bash -c 'vtysh -f /etc/frr/frr.conf'")
2025-06-12 09:22:11,925 WARN: r2: Router(r2): proc failed: rc 1 pid 1312636
args: /usr/bin/nsenter --mount=/proc/1303572/ns/mnt --net=/proc/1303572/ns/net --uts=/proc/1303572/ns/uts -F --wd=/tmp/topotests/rip_passive_interface.test_a/r2 /bin/bash -c vtysh -f /etc/frr/frr.conf
stdout: [1312789|mgmtd] sending configuration
[1312837|ripd] sending configuration
[1312829|zebra] sending configuration
% commit failed session-id 4 on Unknown-FD-13 req-id 3 source-ds: candidate target-ds: running validate-only: 0: reason: 'Failed to create cfgdata: invalid address 0.0.10.1/32' (for MESSAGE_COMMCFG_REQ, client vty-mgmtd-1311739)
[1312789|mgmtd] Configuration file[/etc/frr/frr.conf] processing failure: 1
[1312837|ripd] done
[1312829|zebra] done
[1312968|staticd] sending configuration
[1312968|staticd] done
Waiting for children to finish applying config...
stderr: *empty*
Sometimes it does not.
When the test fails the support bundle for the rip process gives this:
Hello, this is FRRouting (version 10.5-dev).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
r2# show ip rip
% 2025/06/12 10:15:11.691
% RIP instance not found
r2# show ip rip status
% 2025/06/12 10:15:11.692
% RIP instance not found
r2#
Version
latest master
How to reproduce
run rip_passive_interface in parallel. We are seeing this failure in our CI system ( which is why I looked at this more closely )
Expected behavior
no mgmtd crash and the test should pass
Actual behavior
see above
Additional context
I like Llama's
Checklist
- [x] I have searched the open issues for this bug.
- [x] I have not included sensitive information in this report.
There looks like to me that there are several issues here. Let's start with this and as we peel the onion we can open more issues as needed.