src icon indicating copy to clipboard operation
src copied to clipboard

Kernel panic after upgrade to 24.1 with HA state synchronization enabled

Open Kishi85 opened this issue 1 year ago • 1 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [X] I have read the contributing guide lines at https://github.com/opnsense/src/blob/master/CONTRIBUTING.md
  • [X] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/src/issues?q=is%3Aissue

Describe the bug

After upgrading the passive/backup primary node of the FW cluster from 23.7 to 24.1 (secondary being upgraded before). It panics upon starting the interfaces (seems to occur right on the Cluster interface specifically) with the following stack trace (cropped due to serial terminal limits but a full crash report was submitted using the WebUI after working around the issue):

lo0: link state changed to UP                                                                                                                                                                                                                                        [317/15483]
[fib_algo] inet.0 (bsearch4#32) rebuild_fd_flm: switching algo to radix4_lockless                                                                                                                                                                                               
Sleeping thread (tid 100538, pid 95063) owns a non-sleepable lock                                                                                                                                                                                                               
KDB: stack backtrace of thread 100538:                                                                                                                                                                                                                                          
sched_switch() at sched_switch+0x818/frame 0xfffffe0247de3a10                                                                                                                                                                                                                   
mi_switch() at mi_switch+0xc2/frame 0xfffffe0247de3a30                                                                                                                                                                                                                          
_sx_xlock_hard() at _sx_xlock_hard+0x3e4/frame 0xfffffe0247de3ae0                                                                                                                                                                                                               
in_leavegroup() at in_leavegroup+0x80/frame 0xfffffe0247de3b10                                                                                                                                                                                                                  
pfsync_multicast_cleanup() at pfsync_multicast_cleanup+0x2b/frame 0xfffffe0247de3b40                                                                                                                                                                                            
pfsyncioctl() at pfsyncioctl+0x6fd/frame 0xfffffe0247de3bc0                                                                                                                                                                                                                     
ifioctl() at ifioctl+0x7bc/frame 0xfffffe0247de3cc0                                                                                                                                                                                                                             
kern_ioctl() at kern_ioctl+0x26d/frame 0xfffffe0247de3d30                                                                                                                                                                                                                       
sys_ioctl() at sys_ioctl+0x100/frame 0xfffffe0247de3e00                                                                                                                                                                                                                         
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe0247de3f30                                                                                                                                                                                                                 
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0247de3f30                                                                                                                                                                                                      
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x17204d3321ca, rsp = 0x17204a309e78, rbp = 0x17204a309ec0 ---                                                                                                                                                                    
panic: sleeping thread                                                                                                                                                                                                                                                          
cpuid = 6                                                                                                                                                                                                                                                                       
time = 1714383055                                                                                                                                                                                                                                                               
KDB: stack backtrace:                                                                                                                                                                                                                                                           
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe02137c7980                                                                                                                                                                                                  
vpanic() at vpanic+0x151/frame 0xfffffe02137c79d0                                                                                                                                                                                                                               
panic() at panic+0x43/frame 0xfffffe02137c7a30                                                                                                                                                                                                                                  
propagate_priority() at propagate_priority+0x296/frame 0xfffffe02137c7a70                                                                                                                                                                                                       
turnstile_wait() at turnstile_wait+0x323/frame 0xfffffe02137c7ab0                                                                                                                                                                                                               
__mtx_lock_sleep() at __mtx_lock_sleep+0x180/frame 0xfffffe02137c7b40                                                                                                                                                                                                           
pfsyncioctl() at pfsyncioctl+0x91b/frame 0xfffffe02137c7bc0                                                                                                                                                                                                                     
ifioctl() at ifioctl+0x803/frame 0xfffffe02137c7cc0                                                                                                                                                                                                                             
kern_ioctl() at kern_ioctl+0x26d/frame 0xfffffe02137c7d30                                                                                                                                                                                                                       
sys_ioctl() at sys_ioctl+0x100/frame 0xfffffe02137c7e00                                                                                                                                                                                                                         
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe02137c7f30                                                                                                                                                                                                                 
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe02137c7f30                                                                                                                                                                                                      
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x2e515b92d1ca, rsp = 0x2e5157537fc8, rbp = 0x2e5157538000 ---                                                                                                                                                                    
KDB: enter: panic                                                                                                                                                                                                                                                               
[ thread pid 98568 tid 100537 ]                                                                                                                                                                                                                                                 
Stopped at      kdb_enter+0x37: movq    $0,0x1217e0e(%rip)                                                                                                                                                                                                                      
db:0:kdb.enter.default> textdump set                                                                                                                                                                                                                                            
textdump set                                                                                                                                                                                                                                                                    
db:0:kdb.enter.default>  capture on                                                                                                                                                                                                                                             
db:0:kdb.enter.default>  run lockinfo                                                                                                                                                                                                                                           
db:1:lockinfo> show locks                                                                                                                                                                                                                                                       
No such command; use "help" to list available commands                                                                                                                                                                                                                          
db:1:lockinfo>  show alllocks                                                                                                                                                                                                                                                   
No such command; use "help" to list available commands                                                                                                                                                                                                                          
db:1:lockinfo>  show lockedvnods                                                                                                                                                                                                                                                
Locked vnodes                                                                                                                                                                                                                                                                   
db:0:kdb.enter.default>  show pcpu                                                                                                                                                                                                                                              
cpuid        = 6                                                                                                                                                                                                                                                                
dynamic pcpu = 0xfffffe0154d6e300                                                                                                                                                                                                                                               
curthread    = 0xfffffe0214869740: pid 98568 tid 100537 critnest 1 "ifconfig"                                                                                                                                                                                                   
curpcb       = 0xfffffe0214869c50                                                                                                                                                                                                                                               
fpcurthread  = 0xfffffe0214869740: pid 98568 "ifconfig"                                                                                                                                                                                                                         
idlethread   = 0xfffffe017e889c80: tid 100009 "idle: cpu6"                                                                                                                                                                                                                      
self         = 0xffffffff82e16000                                                                                                                                                                                                                                               
curpmap      = 0xfffffe026506ab20                                                                                                                                                                                                                                               
tssp         = 0xffffffff82e16384                                                                                                                                                                                                                                               
rsp0         = 0xfffffe02137c8000                                                                                                                                                                                                                                               
kcr3         = 0x241bd8000                                                                                                                                                                                                                                                      
ucr3         = 0x241a2b000                                                                                                                                                                                                                                                      
scr3         = 0x241a2b000                                                                                                                                                                                                                                                      
gs32p        = 0xffffffff82e16404                                                                                                                                                                                                                                               
ldt          = 0xffffffff82e16444                                                                                                                                                                                                                                               
tss          = 0xffffffff82e16434                                                                                                                                                                                                                                               
curvnet      = 0xfffff80101648c40                                                                                                                                                                                                                                               
db:0:kdb.enter.default>  bt                                                                                                                                                                                                                                                     
Tracing pid 98568 tid 100537 td 0xfffffe0214869740                                                                                                                                                                                                                              
kdb_enter() at kdb_enter+0x37/frame 0xfffffe02137c7980                                                                                                                                                                                                                          
vpanic() at vpanic+0x182/frame 0xfffffe02137c79d0                                                                                                                                                                                                                               
panic() at panic+0x43/frame 0xfffffe02137c7a30                                                                                                                                                                                                                                  
propagate_priority() at propagate_priority+0x296/frame 0xfffffe02137c7a70                                                                                                                                                                                                       
turnstile_wait() at turnstile_wait+0x323/frame 0xfffffe02137c7ab0 

To Reproduce

Steps to reproduce the behavior:

  1. Upgrade secondary node from 23.7 to 24.1
  2. Switch over active/master to secondary node
  3. Upgrade primary node from 23.7 to 24.1 and let it reboot

Expected behavior

Primary node should update and reboot without issues with HA state synchronization enabled

Describe alternatives you considered

After disabling HA state synchronization on the secondary the primary node boots properly without problems. Failover is not smooth due to states getting lost but works for now.

Relevant log files See stack trace above. Full crash report was submitted after boot succeeded using Firmware/Reporter.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 24.1.6-amd64 FreeBSD 13.2-RELEASE-p11 OpenSSL 3.0.13

directly on Dell PowerEdge R6515 with 4x Broadcom Adv. Dual 25Gb Ethernet (everything on latest available firmware)

Kishi85 avatar Apr 29 '24 10:04 Kishi85

This is also happening on upgrading to kernel 24.1.8 but I've managed to work around it by setting the respective other firewall node as a the unicast sync target IP using the UI and then the boot loop stopped.

Kishi85 avatar Aug 05 '24 09:08 Kishi85

@Kishi85 does this still happen? 24.7 and 25.1 have vastly different code bases on FreeBSD 14.1 and 14.2 respectively.

fichtner avatar Feb 05 '25 11:02 fichtner

I've not been able to schedule an update on our production cluster for 25.1 so far (it'll probably be sometime in march due to scheduling issues unless critical security issues arise) and that is the only place we've seen this issue due to OPNsense running on bare-metal on that cluster for maximum performance. So I cannot tell if it still happens (and it only happens when using a multicast pfsync IP to synchronize for us. Unicast works like intended) until then.

Kishi85 avatar Feb 05 '25 12:02 Kishi85

No problem. I'll keep this open for a while longer, but I half-expect that these issues do not occur in this way anymore.

fichtner avatar Feb 05 '25 13:02 fichtner

@Kishi85 how's it going on this front? We have one more patch in the pipeline for 25.1.x that may help with that if it still occurs....

fichtner avatar Mar 21 '25 07:03 fichtner

I've finally manage to upgrade my backup node to 25.1.4 and promoting it to primary today (the usual primary node will be on backup duty on 24.7.12 until next Monday should issues arise). So far I have not seen any issues but as they say every little helps so if you have another patch that can make this more solid then I'd say go for it unless it is prone to cause potential problems.

Kishi85 avatar Apr 07 '25 09:04 Kishi85

The next fix will be in 25.1.5, but I doubt a bit that it will be relevant to your case... looks like 25.1.x is already fixed in this regard, but I don't mind leaving this open until it becomes clear.

Cheers, Franco

fichtner avatar Apr 07 '25 17:04 fichtner

No Problems on 25.1.5_4 either. Upgraded my other node today, switched over and upgraded the first from 25.1.4_1 to 25.1.5_4 as well.

Kishi85 avatar Apr 14 '25 07:04 Kishi85

Ok nice. Let's close then?

Cheers, Franco

fichtner avatar Apr 14 '25 07:04 fichtner