frr icon indicating copy to clipboard operation
frr copied to clipboard

mgmtd is crashing in rare cases and test failure more often

Open donaldsharp opened this issue 10 months ago • 3 comments

Description

When running the rip_passive_interface test in parallel I am getting a consistent crash with this decode:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139778446252480, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f20b4842476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f20b4d4706e in core_handler (signo=6, siginfo=0x7ffe9bf1f130, context=0x7ffe9bf1f000) at lib/sigevent.c:268
#5  <signal handler called>
#6  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:44
#7  __pthread_kill_internal (signo=6, threadid=139778446252480) at ./nptl/pthread_kill.c:78
#8  __GI___pthread_kill (threadid=139778446252480, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9  0x00007f20b4842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007f20b48287f3 in __GI_abort () at ./stdlib/abort.c:79
#11 0x00007f20b4d901be in _zlog_assert_failed (xref=0x559847a0ef80 <_xref.73>, extra=0x0) at lib/zlog.c:789
#12 0x00005598479cbeae in mgmt_txn_notify_be_txn_reply (txn_id=4, create=true, success=true, adapter=0x55984bba8230) at mgmtd/mgmt_txn.c:1524
#13 0x00005598479bc266 in be_adapter_handle_native_msg (adapter=0x55984bba8230, msg=0x55984bbd84d8, msg_len=32) at mgmtd/mgmt_be_adapter.c:628
#14 0x00005598479bc67b in mgmt_be_adapter_process_msg (version=1 '\001', data=0x55984bbd84d8 "\016", len=32, conn=0x55984b7a4bf0) at mgmtd/mgmt_be_adapter.c:696
#15 0x00007f20b4cf3070 in mgmt_msg_procbufs (ms=0x55984b7a4bf8, handle_msg=0x5598479bc623 <mgmt_be_adapter_process_msg>, user=0x55984b7a4bf0, debug=false) at lib/mgmt_msg.c:188
#16 0x00007f20b4cf3f7a in msg_conn_proc_msgs (thread=0x7ffe9bf201c0) at lib/mgmt_msg.c:521
#17 0x00007f20b4d62abe in event_call (thread=0x7ffe9bf201c0) at lib/event.c:2005
#18 0x00007f20b4cd51a0 in frr_run (loop=0x55984b6bae10) at lib/libfrr.c:1247
#19 0x00005598479b87f1 in main (argc=7, argv=0x7ffe9bf203f8) at mgmtd/mgmt_main.c:292
(gdb) f 12
#12 0x00005598479cbeae in mgmt_txn_notify_be_txn_reply (txn_id=4, create=true, success=true, adapter=0x55984bba8230) at mgmtd/mgmt_txn.c:1524
1524		assert(txn->commit_cfg_req);
(gdb) p txn->commit_cfg_req
$1 = (struct mgmt_txn_req *) 0x0
(gdb) l
1519			return -1;
1520	
1521		if (!create && !txn->commit_cfg_req)
1522			return 0;
1523	
1524		assert(txn->commit_cfg_req);
1525		cmtcfg_req = &txn->commit_cfg_req->req.commit_cfg;
1526		if (create) {
1527			if (success) {
1528				/*
(gdb) p create
$2 = true
(gdb) 

In addition I am getting regular test failures for this test. Sometimes mgmtd complains about this:

2025-06-12 09:22:11,888 DEBUG: r2: cmd_status("/bin/bash -c 'vtysh -f /etc/frr/frr.conf'")
2025-06-12 09:22:11,925  WARN: r2: Router(r2): proc failed: rc 1 pid 1312636
        args: /usr/bin/nsenter --mount=/proc/1303572/ns/mnt --net=/proc/1303572/ns/net --uts=/proc/1303572/ns/uts -F --wd=/tmp/topotests/rip_passive_interface.test_a/r2 /bin/bash -c vtysh -f /etc/frr/frr.conf
        stdout: [1312789|mgmtd] sending configuration
[1312837|ripd] sending configuration
[1312829|zebra] sending configuration
% commit failed session-id 4 on Unknown-FD-13 req-id 3 source-ds: candidate target-ds: running validate-only: 0: reason: 'Failed to create cfgdata: invalid address 0.0.10.1/32' (for MESSAGE_COMMCFG_REQ, client vty-mgmtd-1311739)
[1312789|mgmtd] Configuration file[/etc/frr/frr.conf] processing failure: 1
[1312837|ripd] done
[1312829|zebra] done
[1312968|staticd] sending configuration
[1312968|staticd] done
Waiting for children to finish applying config...
        stderr: *empty*

Sometimes it does not.

When the test fails the support bundle for the rip process gives this:

Hello, this is FRRouting (version 10.5-dev).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

r2# show ip rip
% 2025/06/12 10:15:11.691

% RIP instance not found
r2# show ip rip status
% 2025/06/12 10:15:11.692

% RIP instance not found
r2#

Version

latest master

How to reproduce

run rip_passive_interface in parallel. We are seeing this failure in our CI system ( which is why I looked at this more closely )

Expected behavior

no mgmtd crash and the test should pass

Actual behavior

see above

Additional context

I like Llama's

Checklist

  • [x] I have searched the open issues for this bug.
  • [x] I have not included sensitive information in this report.

donaldsharp avatar Jun 12 '25 14:06 donaldsharp

There looks like to me that there are several issues here. Let's start with this and as we peel the onion we can open more issues as needed.

donaldsharp avatar Jun 12 '25 14:06 donaldsharp