axgbe: Jumbo frames only work up to 4082 bytes for MTU
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- [x] I have read the contributing guide lines at https://github.com/opnsense/src/blob/master/CONTRIBUTING.md
- [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/src/issues?q=is%3Aissue
Describe the bug
I own a DEC2752 firewall from Deciso, and running OPNsense. I am trying to enable 9000 bytes Jumbo frames on the two 10GBit "axgbe" ports. While this does work from the UI, and "ifconfig" on OPNsense also confirms the configuration setting, only jumbo frames up to 4ß82 bytes are supported.
To Reproduce
Steps to reproduce the behavior:
- Enable a MTU of 9000 for one of the axgbe interfaces (ax0 or ax1)
- Connect a Linux client, also with a MTU of 9000 to the corresponding port of the OPNsense firewall
- Perform a "ping -M do -s 8972 opnsense" from the Linux client to the OPNsense firewall
- See that pings won't be returned
- Perform a "ping -M do -s 4054 opnsense" from the Linux client to the OPNsense firewall
- See that pings will be returned
- Perform a "ping -M do -s 4055 opnsense" from the Linux client to the OPNsense firewall
- See that pings will NOT be returned
Expected behavior
If both the OPNsense and the client have set an MTU of 9000, then a ping with a packet size of 8972 bytes should work.
Additional context
I did some investigation into the FreeBSD driver code for axgbe. It seems that the receive buffer size is limited to 4k:
int
xgbe_calc_rx_buf_size(struct ifnet *netdev, unsigned int mtu)
{
unsigned int rx_buf_size;
if (mtu > XGMAC_JUMBO_PACKET_MTU)
return (-EINVAL);
rx_buf_size = mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
rx_buf_size = min(max(rx_buf_size, XGBE_RX_MIN_BUF_SIZE), PAGE_SIZE);
rx_buf_size = (rx_buf_size + XGBE_RX_BUF_ALIGN - 1) &
~(XGBE_RX_BUF_ALIGN - 1);
return (rx_buf_size);
}
The line rx_buf_size = min(max(rx_buf_size, XGBE_RX_MIN_BUF_SIZE), PAGE_SIZE); will effectively limit the buffer size to the page size. I suspect that either multiple (possibly continguous) pages would need to be allocated, or maybe the hardware also supports chaining multiple memory pages.
I did some research, and it seems that the AMD Ryzen V1500B is used in a couple of NAS devices from Synology and QNAP, which support 10GBit interfaces with jumbo frames of at least 9000 bytes. I could not find a test with an explicit test of the jumbo frame functionality (i.e. using a ping with appropriate packet sizes), but some people were using jumbo frames with 9000 bytes. At least with Synology, I trust that if one can configure 9000 bytes jumbo frames, then this will also work.
Therefore I assume that this is a driver limitation and not a hardware limitation.
Also see the discussion thread at https://forum.opnsense.org/index.php?topic=29359.0
Environment
OPNsense 25.4 business edition (amd64) DEC2752 (AMD Ryzen V1500B)
@kupferk thanks for the writeup, jumbo frames are indeed only limited supported on axgbe at the moment. Durinf most support cases we have seen sofar, customers who wanted to use larger frames often misunderstood when they are useful.
As in most cases traffic is going to a smaller pipe at some point, offloading fragmentation to the firewall is often not a great idea. I'll ask my colleague @swhite2 to take a small look if we can patch this easily on our end when he has some time available.
@AdSchellevis thanks for your comments. You are completely right about the questionable usefulness of jumbo frames and fragmentation on the firewall. My use case would be a firewall between internal VLANs, where I would hope to get a better throghput or lower CPU load on the OPNsense itself if jumbo frames were used.
Actually, it would already be helpful if the UI would not allow jumbo frames lager than 4k on axgbe interfaces and if that limitation was mentioned in the specs of the affected Deciso products. That would at least help to clarify the situation and save users time from finding the limitations on their side :)
it would be practical to be able to offer validations on mtu sizes, but unfortunately there's no standard defined to account for the maximum a driver supports. The current options originate from https://github.com/opnsense/core/issues/4359, at which time there was also a lot of discussion about the validation.
I'll ask my colleague to check what it would take to support 9000 bytes, there is a chance the boundary check is just a leftover from the older driver of an ancient platform:
https://github.com/opnsense/src/blob/44b781cfe0b561909686778153915ec2b0ba5a21/sys/dev/axgbe/xgbe-drv.c#L266
@kupferk Thanks for providing the initial details, the code path you mentioned indeed clamps the buf size, explaining the behavior you're seeing. Fixing this however, only fixes the RX path.
Since iflib handles the allocations and buffer-alignment for DMA, https://github.com/opnsense/src/commit/f2e51f2174229edd6376099a7cc35c1bd60b68d6 should be all that's needed. I verified that ping is working on my end, but perhaps you can do a more extensive field test.
You can test the change with
# opnsense-update -zkr 25.1.6-axgbe
(requires a reboot)
This is wonderful news @swhite2 , thanks for your work! I will try to find a good time window to give your patch a try and provide feedback.
@kupferk Great, thanks in advance for testing :)
FWIW, this kernel is also suitable for the BE (25.4), I'll ask Franco to add it to the business mirror as well.
Update: It's on the business mirror, same command
@swhite2 I did a test, with mixed results. Initially, after configuring a MTU of 9000, I could run a "ping -M do -s opnsense" from my Linux client to the OPNsense firewall. But after a couple of minutes with repeated experiments with ping, the OPNsense crashed and rebooted.
I did find a crash file in /var/crash, but the crash location is surprising (at least for me):
Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 03
instruction pointer = 0x20:0xffffffff80cc2b83
stack pointer = 0x28:0xfffffe00b187ec70
frame pointer = 0x28:0xfffffe00b187ecc0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 3757 (python3.11)
rdi: fffffe00b187ec78 rsi: fffffe00b187ed18 rdx: fffff80002898700
rcx: fffffe00b187ecd8 r8: fffff800084b69b0 r9: fffffe00b3004430
rax: adacabaaa9a8a7a6 rbx: fffff801bdd6e8c0 rbp: fffffe00b187ecc0
r10: 00000000000000c0 r11: 000000000000001a r12: fffff800084b69b0
r13: 0000000000000000 r14: fffff80002898700 r15: fffffe00b187ed18
trap number = 9
panic: general protection fault
cpuid = 3
time = 1747337375
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b187e9b0
vpanic() at vpanic+0x131/frame 0xfffffe00b187eae0
panic() at panic+0x43/frame 0xfffffe00b187eb40
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00b187eba0
calltrap() at calltrap+0x8/frame 0xfffffe00b187eba0
--- trap 0x9, rip = 0xffffffff80cc2b83, rsp = 0xfffffe00b187ec70, rbp = 0xfffffe00b187ecc0 ---
vn_statfile() at vn_statfile+0x53/frame 0xfffffe00b187ecc0
kern_fstat() at kern_fstat+0x6b/frame 0xfffffe00b187ed00
sys_fstat() at sys_fstat+0x1d/frame 0xfffffe00b187ee00
amd64_syscall() at amd64_syscall+0x10e/frame 0xfffffe00b187ef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00b187ef30
--- syscall (551, FreeBSD ELF64, fstat), rip = 0x8257fb21a, rsp = 0x8209a7b68, rbp = 0x8209a7ca0 ---
Maybe unrelated? On the other hand, OPNsense was running rock solid before the experiment.
For the time being, I kept the kernel, but set back the MTU to 1500.
@kupferk The panic in a seemingly random location is due to general memory corruption. Looking at the code more closely it seems that no matter what value we set for the rx_buf_size, it always gets clamped to the PAGE_SIZE anyway in iflib: https://github.com/opnsense/src/blob/stable/25.1/sys/net/iflib.c#L2454-L2482. For jumbo-sized packets it simply expects multiple segments to form a complete frame (igb and igc do the same thing https://github.com/opnsense/src/blob/stable/25.1/sys/dev/e1000/if_em.c#L1590).
It's likely that when we increase rx_buf_size, telling the NIC the kernel allocated 9024 bytes for each packet and therefore it DMAs the whole content to memory, it starts overwriting in the ring (aligned to PAGE_SIZE buffer sizes). The current assumption is that there are no boundary checks on packet length vs allocated buffer size.
By the way, the stock kernel also receives the jumbo frames (check tcpdump), it's just corrupted data, as the NIC has been instructed there is only 4096 bytes available to write to (probably the last 4096 bytes of the packet DMA'd into memory).
The working theory is that 4096 bytes for a buffer should be perfectly fine, provided the NIC can split larger packets up into multiple segments (multiple descriptors), I just don't see how yet.
Thanks for your investigation @swhite2 . I also tried to do a comparison with the current Linux code, and saw that the clamping is also still in place. Unfortunately, the code diverged from the FreeBSD code a lot over time, so it's not easy (for me) to spot the bits that make the driver work on Linux.
Maybe it's simply not worth the effort (at least not now), if it's not a low hanging fruit :) I suggest to add some documentation for OPNsense/Deciso that jumbo frames currently are limited.
@kupferk Thanks for looking into it. For what it's worth, I don't think it works on Linux either. I believe the products you mentioned do not use axgbe for 10Gbit connectivity. If we can confirm that they do it might be worth looking into the difference in default hardware parameters.
I'll leave this ticket open for now as it's unclear to me how much time it will take to fix this. I'll discuss mentioning this in the docs later today :)
Just FYI @swhite2 according to https://forum.opnsense.org/index.php?topic=29359.msg237244#msg237244 jumbo frames are working fine on Linux on a DEC740. I cannot test myself, because I only have a single Deciso device.
@kupferk Fair enough, though looking at the driver code on the linux side there are quite some differences in packet handling - most notably the default usage of the split header function (which significantly changes how the hardware marks packets), which we must keep disabled for netmap compatibility. If I have some spare time I will test whether this would make the NIC behave differently.
I did the test with VyOS (Debian based) that is mentioned above.
Today I did a more complex test with a storage device on one 10G (STORAGE_LAN) and a client on the other 10G (CLIENT_LAN) with OPNsense. Mounting the share is fine but creating data with dd let it crash right away.
2025-05-20_opnsense-mtu-9000-testing_crash-log.txt
~~Tomorrow I'll do the same test with VyOS to see how that behaves.~~ Done the tests with the same setup as with OPNsense but with VyOS, no issue with MTU 9000. Data transfers worked in both directions.