MGMT crashes on first start on ARM 32bits
Description
Hello,
I am working on deploying FRR on an ARM based device, and I am investigating an issue with mgmtd systematically crashing on the first start. Here are all the details:
- I am using the Buildroot project on branch 2025.02 to build and install FRR
- so the FRR project version is 10.3
- I am using the provided systemd service
- I can reproduce the issue with the default
/etc/frr/daemonsconfiguration file (so no explicit enable on any service daemon) - the systemd service is not enabled by default
- when manually starting the systemd service with
systemctl start frr, the following splat is seen in the main console:
[ 196.675935] Alignment trap: not handling instruction edc40b02 at [<004d4bfc>]
[ 196.685597] 8<--- cut here ---
[ 196.688694] Unhandled fault: alignment exception (0x801) at 0x008697e1
[ 196.696432] [008697e1] *pgd=7fc85831
The system logs give the following:
systemd[1]: Starting FRRouting...
frrinit.sh[2793]: Starting watchfrr with command: ' /usr/sbin/watchfrr -d -F traditional zebra mgmtd ospf6d staticd'
watchfrr[2805]: [T83RR-8SM5G] watchfrr 10.3 starting: vty@0
watchfrr[2805]: [ZCJ3S-SPH5S] zebra state -> down : initial connection attempt failed
watchfrr[2805]: [ZCJ3S-SPH5S] mgmtd state -> down : initial connection attempt failed
watchfrr[2805]: [ZCJ3S-SPH5S] ospf6d state -> down : initial connection attempt failed
watchfrr[2805]: [ZCJ3S-SPH5S] staticd state -> down : initial connection attempt failed
watchfrr[2805]: [YFT0P-5Q5YX] Forked background command [pid 2806]: /usr/sbin/watchfrr.sh restart all
frrinit.sh[2817]: 2025/11/20 09:26:05 ZEBRA: [NNACN-54BDA][EC 4043309110] Disabling MPLS support (no kernel support)
kernel: Alignment trap: not handling instruction edc40b02 at [<004fbbfc>]
kernel: 8<--- cut here ---
kernel: Unhandled fault: alignment exception (0x801) at 0x008a08de
kernel: [008a08de] *pgd=7f8e1831
systemd-coredump[2846]: Process 2843 (mgmtd) of user 102 terminated abnormally with signal 7/BUS, processing...
systemd[1]: Created slice Slice /system/systemd-coredump.
systemd[1]: Started Process Core Dump (PID 2846/UID 0).
frrinit.sh[2863]: [2863|mgmtd] sending configuration
frrinit.sh[2864]: [2864|zebra] sending configuration
frrinit.sh[2868]: [2868|ospf6d] sending configuration
zebra[2830]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
frrinit.sh[2864]: [2864|zebra] done
ospf6d[2852]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
frrinit.sh[2868]: [2868|ospf6d] done
frrinit.sh[2878]: [2878|watchfrr] sending configuration
frrinit.sh[2880]: [2880|staticd] sending configuration
staticd[2855]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
frrinit.sh[2858]: Waiting for children to finish applying config...
watchfrr[2805]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
frrinit.sh[2880]: [2880|staticd] done
frrinit.sh[2878]: [2878|watchfrr] done
watchfrr[2805]: [QDG3Y-BY5TN] ospf6d state -> up : connect succeeded
watchfrr[2805]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
watchfrr[2805]: [QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
watchfrr[2805]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
watchfrr[2805]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
frrinit.sh[2793]: Started watchfrr
systemd[1]: Started FRRouting.
The corresponding crashlog file in /var/tmp/frr/mgmtd.<PID>/ reports the following:
MGMTD: Received signal 7 at 1763629821 (si_addr 0x7b78de); aborting...
MGMTD: /lib/libfrr.so.0(zlog_backtrace_sigsafe+0x5c) [0xb6e68238]
MGMTD: /lib/libfrr.so.0(zlog_signal+0xe0) [0xb6e68428]
MGMTD: /lib/libfrr.so.0(+0x1014f0) [0xb6ebe4f0]
MGMTD: /lib/libc.so.6(__default_rt_sa_restorer+0) [0xb68e6d90]
MGMTD: /usr/sbin/mgmtd(mgmt_fe_adapter_send_notify+0x39c) [0x412c00]
MGMTD: /lib/libfrr.so.0(mgmt_msg_procbufs+0x140) [0xb6e763dc]
MGMTD: /lib/libfrr.so.0(+0xb950c) [0xb6e7650c]
MGMTD: /lib/libfrr.so.0(event_call+0x94) [0xb6ed6844]
MGMTD: /lib/libfrr.so.0(frr_run+0xd4) [0xb6e5d250]
MGMTD: /usr/sbin/mgmtd(main+0x1a4) [0x40a99c]
MGMTD: /lib/libc.so.6(+0x236b0) [0xb68d16b0]
MGMTD: /lib/libc.so.6(__libc_start_main+0x98) [0xb68d1790]
MGMTD: in thread msg_conn_proc_msgs scheduled from lib/mgmt_msg.c:548 msg_conn_sched_proc_msgs()
I have been able to retrieve the corresponding core file and validate this call stack in gdb:
(gdb) bt
#0 0xb6889cc4 in __pthread_kill_implementation () from ../../staging/lib/libc.so.6
#1 0xb6840df4 in raise () from ../../staging/lib/libc.so.6
#2 0xb6e19518 in core_handler (signo=7, siginfo=0xbef756c8, context=<optimized out>) at lib/sigevent.c:268
#3 <signal handler called>
#4 0x00491c00 in mgmt_fe_adapter_send_notify (msg=0x8368d6, msglen=238) at mgmtd/mgmt_fe_adapter.c:1972
#5 0xb6dd13dc in mgmt_msg_procbufs (ms=ms@entry=0x536358, handle_msg=0x48ca5c <mgmt_be_adapter_process_msg>, user=user@entry=0x536350, debug=<optimized out>) at lib/mgmt_msg.c:193
#6 0xb6dd150c in msg_conn_proc_msgs (thread=<optimized out>) at lib/mgmt_msg.c:526
#7 0xb6e31844 in event_call (thread=thread@entry=0xbef75b5c) at lib/event.c:1984
#8 0xb6db8250 in frr_run (master=0x4dec70) at lib/libfrr.c:1246
#9 0x0048999c in main (argc=6, argv=0xbef75db4) at mgmtd/mgmt_main.c:290
(gdb) frame 4
#4 0x00491c00 in mgmt_fe_adapter_send_notify (msg=0x8368d6, msglen=238) at mgmtd/mgmt_fe_adapter.c:1972
1972 }
(gdb) l
1967 }
1968 }
1969 }
1970
1971 msg->refer_id = 0;
1972 }
1973
1974 void mgmt_fe_adapter_lock(struct mgmt_fe_client_adapter *adapter)
1975 {
1976 adapter->refcount++;
(gdb) disas $pc-8,$pc+8
Dump of assembler code from 0x491bf8 to 0x491c08:
0x00491bf8 <mgmt_fe_adapter_send_notify+916>: add r2, pc, r2
0x00491bfc <mgmt_fe_adapter_send_notify+920>: vstr d16, [r4, #8]
=> 0x00491c00 <mgmt_fe_adapter_send_notify+924>: ldr r3, [r2, r3]
0x00491c04 <mgmt_fe_adapter_send_notify+928>: ldr r2, [r3]
End of assembler dump.
(gdb) info reg
r0 0x0 0
r1 0xb67a1e70 3061456496
r2 0x4d9c54 5086292
r3 0x33c 828
r4 0x8368d6 8612054
r5 0x536358 5464920
r6 0xf6 246
r7 0x0 0
r8 0x4d9c54 5086292
r9 0x8368f6 8612086
r10 0xee 238
r11 0xbef75a84 3203881604
r12 0x4d9cc0 5086400
sp 0xbef75a40 0xbef75a40
lr 0x491b10 4791056
pc 0x491c00 0x491c00 <mgmt_fe_adapter_send_notify+924>
cpsr 0x40070010 1074200592
fpscr 0x0 0
The service eventually properly starts and remains up after this transitory crash.
Version
FRRouting 10.3 ([redacted]) on Linux(6.12.47).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
'--target=arm-buildroot-linux-gnueabihf' '--host=arm-buildroot-linux-gnueabihf' '--build=x86_64-pc-linux-gnu' '--prefix=/usr' '--exec-prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--program-prefix=' '--disable-gtk-doc' '--disable-gtk-doc-html' '--disable-doc' '--disable-docs' '--disable-documentation' '--with-xmlto=no' '--with-fop=no' '--disable-dependency-tracking' '--enable-ipv6' '--disable-nls' '--disable-static' '--enable-shared' '--with-clippy=/workspace/build/output/host/bin/clippy' '--with-moduledir=/usr/lib/frr/modules' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' '--enable-multipath=256' '--disable-ospfclient' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-fpm' '--disable-bgp-bmp' '--disable-nhrpd' '--enable-capabilities' '--enable-config-rollbacks' '--disable-zeromq' '--disable-bfdd' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=arm-buildroot-linux-gnueabihf' 'target_alias=arm-buildroot-linux-gnueabihf' 'AR=/workspace/build/output/host/bin/arm-linux-gcc-ar' 'LD=/workspace/build/output/host/bin/arm-linux-ld' 'OBJCOPY=/workspace/build/output/host/bin/arm-linux-objcopy' 'OBJDUMP=/workspace/build/output/host/bin/arm-linux-objdump' 'RANLIB=/workspace/build/output/host/bin/arm-linux-gcc-ranlib' 'STRIP=/workspace/build/output/host/bin/arm-linux-strip' 'PKG_CONFIG=/workspace/build/output/host/bin/pkg-config' 'CC=/workspace/build/output/host/bin/arm-linux-gcc' 'LDFLAGS=' 'LIBS=-latomic' 'CPP=/workspace/build/output/host/bin/arm-linux-cpp' 'CXX=/workspace/build/output/host/bin/arm-linux-g++'
How to reproduce
- download git clone https://gitlab.com/buildroot.org/buildroot.git
- configure it for an ARM32 target with at least BR2_PACKAGE_FRR
- deploy your image
- manually install the
frr.servicefile onto your system (currently not auto-deployed by Buildroot, in progress) - run frr:
systemctl start frr.service
Expected behavior
There should not be any alignment trap exception raised by the kernel on mgtm There should not have any start-stop-restart sequence of all FRR services at first start, only a single start, and all services running after that
Actual behavior
See description above, especially the system logs.
Additional context
- I am using the Buildroot project on branch 2025.02 to build and install FRR
- so the FRR project version is 10.3
Checklist
- [x] I have searched the open issues for this bug.
- [x] I have not included sensitive information in this report.
I made a little progress on the issue, and identified an older version which does not suffer from the crash (frr-10.1). Based on this info, I performed a bisect, which eventually highlighted commit 597d79a89e6628645e3648bdc03db0d7bdc7e0f4. I then confirmed that:
- running frr-10.3 exposes the crash
- running frr-10.3 with 597d79a89e6628645e3648bdc03db0d7bdc7e0f4 does not expose the crash