Kernel panic after upgrade to 24.1 with HA state synchronization enabled
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- [X] I have read the contributing guide lines at https://github.com/opnsense/src/blob/master/CONTRIBUTING.md
- [X] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/src/issues?q=is%3Aissue
Describe the bug
After upgrading the passive/backup primary node of the FW cluster from 23.7 to 24.1 (secondary being upgraded before). It panics upon starting the interfaces (seems to occur right on the Cluster interface specifically) with the following stack trace (cropped due to serial terminal limits but a full crash report was submitted using the WebUI after working around the issue):
lo0: link state changed to UP [317/15483]
[fib_algo] inet.0 (bsearch4#32) rebuild_fd_flm: switching algo to radix4_lockless
Sleeping thread (tid 100538, pid 95063) owns a non-sleepable lock
KDB: stack backtrace of thread 100538:
sched_switch() at sched_switch+0x818/frame 0xfffffe0247de3a10
mi_switch() at mi_switch+0xc2/frame 0xfffffe0247de3a30
_sx_xlock_hard() at _sx_xlock_hard+0x3e4/frame 0xfffffe0247de3ae0
in_leavegroup() at in_leavegroup+0x80/frame 0xfffffe0247de3b10
pfsync_multicast_cleanup() at pfsync_multicast_cleanup+0x2b/frame 0xfffffe0247de3b40
pfsyncioctl() at pfsyncioctl+0x6fd/frame 0xfffffe0247de3bc0
ifioctl() at ifioctl+0x7bc/frame 0xfffffe0247de3cc0
kern_ioctl() at kern_ioctl+0x26d/frame 0xfffffe0247de3d30
sys_ioctl() at sys_ioctl+0x100/frame 0xfffffe0247de3e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe0247de3f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0247de3f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x17204d3321ca, rsp = 0x17204a309e78, rbp = 0x17204a309ec0 ---
panic: sleeping thread
cpuid = 6
time = 1714383055
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe02137c7980
vpanic() at vpanic+0x151/frame 0xfffffe02137c79d0
panic() at panic+0x43/frame 0xfffffe02137c7a30
propagate_priority() at propagate_priority+0x296/frame 0xfffffe02137c7a70
turnstile_wait() at turnstile_wait+0x323/frame 0xfffffe02137c7ab0
__mtx_lock_sleep() at __mtx_lock_sleep+0x180/frame 0xfffffe02137c7b40
pfsyncioctl() at pfsyncioctl+0x91b/frame 0xfffffe02137c7bc0
ifioctl() at ifioctl+0x803/frame 0xfffffe02137c7cc0
kern_ioctl() at kern_ioctl+0x26d/frame 0xfffffe02137c7d30
sys_ioctl() at sys_ioctl+0x100/frame 0xfffffe02137c7e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe02137c7f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe02137c7f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x2e515b92d1ca, rsp = 0x2e5157537fc8, rbp = 0x2e5157538000 ---
KDB: enter: panic
[ thread pid 98568 tid 100537 ]
Stopped at kdb_enter+0x37: movq $0,0x1217e0e(%rip)
db:0:kdb.enter.default> textdump set
textdump set
db:0:kdb.enter.default> capture on
db:0:kdb.enter.default> run lockinfo
db:1:lockinfo> show locks
No such command; use "help" to list available commands
db:1:lockinfo> show alllocks
No such command; use "help" to list available commands
db:1:lockinfo> show lockedvnods
Locked vnodes
db:0:kdb.enter.default> show pcpu
cpuid = 6
dynamic pcpu = 0xfffffe0154d6e300
curthread = 0xfffffe0214869740: pid 98568 tid 100537 critnest 1 "ifconfig"
curpcb = 0xfffffe0214869c50
fpcurthread = 0xfffffe0214869740: pid 98568 "ifconfig"
idlethread = 0xfffffe017e889c80: tid 100009 "idle: cpu6"
self = 0xffffffff82e16000
curpmap = 0xfffffe026506ab20
tssp = 0xffffffff82e16384
rsp0 = 0xfffffe02137c8000
kcr3 = 0x241bd8000
ucr3 = 0x241a2b000
scr3 = 0x241a2b000
gs32p = 0xffffffff82e16404
ldt = 0xffffffff82e16444
tss = 0xffffffff82e16434
curvnet = 0xfffff80101648c40
db:0:kdb.enter.default> bt
Tracing pid 98568 tid 100537 td 0xfffffe0214869740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe02137c7980
vpanic() at vpanic+0x182/frame 0xfffffe02137c79d0
panic() at panic+0x43/frame 0xfffffe02137c7a30
propagate_priority() at propagate_priority+0x296/frame 0xfffffe02137c7a70
turnstile_wait() at turnstile_wait+0x323/frame 0xfffffe02137c7ab0
To Reproduce
Steps to reproduce the behavior:
- Upgrade secondary node from 23.7 to 24.1
- Switch over active/master to secondary node
- Upgrade primary node from 23.7 to 24.1 and let it reboot
Expected behavior
Primary node should update and reboot without issues with HA state synchronization enabled
Describe alternatives you considered
After disabling HA state synchronization on the secondary the primary node boots properly without problems. Failover is not smooth due to states getting lost but works for now.
Relevant log files See stack trace above. Full crash report was submitted after boot succeeded using Firmware/Reporter.
Environment
Software version used and hardware type if relevant, e.g.:
OPNsense 24.1.6-amd64 FreeBSD 13.2-RELEASE-p11 OpenSSL 3.0.13
directly on Dell PowerEdge R6515 with 4x Broadcom Adv. Dual 25Gb Ethernet (everything on latest available firmware)
This is also happening on upgrading to kernel 24.1.8 but I've managed to work around it by setting the respective other firewall node as a the unicast sync target IP using the UI and then the boot loop stopped.
@Kishi85 does this still happen? 24.7 and 25.1 have vastly different code bases on FreeBSD 14.1 and 14.2 respectively.
I've not been able to schedule an update on our production cluster for 25.1 so far (it'll probably be sometime in march due to scheduling issues unless critical security issues arise) and that is the only place we've seen this issue due to OPNsense running on bare-metal on that cluster for maximum performance. So I cannot tell if it still happens (and it only happens when using a multicast pfsync IP to synchronize for us. Unicast works like intended) until then.
No problem. I'll keep this open for a while longer, but I half-expect that these issues do not occur in this way anymore.
@Kishi85 how's it going on this front? We have one more patch in the pipeline for 25.1.x that may help with that if it still occurs....
I've finally manage to upgrade my backup node to 25.1.4 and promoting it to primary today (the usual primary node will be on backup duty on 24.7.12 until next Monday should issues arise). So far I have not seen any issues but as they say every little helps so if you have another patch that can make this more solid then I'd say go for it unless it is prone to cause potential problems.
The next fix will be in 25.1.5, but I doubt a bit that it will be relevant to your case... looks like 25.1.x is already fixed in this regard, but I don't mind leaving this open until it becomes clear.
Cheers, Franco
No Problems on 25.1.5_4 either. Upgraded my other node today, switched over and upgraded the first from 25.1.4_1 to 25.1.5_4 as well.
Ok nice. Let's close then?
Cheers, Franco