src icon indicating copy to clipboard operation
src copied to clipboard

axgbe: Jumbo frames only work up to 4082 bytes for MTU

Open kupferk opened this issue 9 months ago • 13 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/src/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/src/issues?q=is%3Aissue

Describe the bug

I own a DEC2752 firewall from Deciso, and running OPNsense. I am trying to enable 9000 bytes Jumbo frames on the two 10GBit "axgbe" ports. While this does work from the UI, and "ifconfig" on OPNsense also confirms the configuration setting, only jumbo frames up to 4ß82 bytes are supported.

To Reproduce

Steps to reproduce the behavior:

  1. Enable a MTU of 9000 for one of the axgbe interfaces (ax0 or ax1)
  2. Connect a Linux client, also with a MTU of 9000 to the corresponding port of the OPNsense firewall
  3. Perform a "ping -M do -s 8972 opnsense" from the Linux client to the OPNsense firewall
  4. See that pings won't be returned
  5. Perform a "ping -M do -s 4054 opnsense" from the Linux client to the OPNsense firewall
  6. See that pings will be returned
  7. Perform a "ping -M do -s 4055 opnsense" from the Linux client to the OPNsense firewall
  8. See that pings will NOT be returned

Expected behavior

If both the OPNsense and the client have set an MTU of 9000, then a ping with a packet size of 8972 bytes should work.

Additional context

I did some investigation into the FreeBSD driver code for axgbe. It seems that the receive buffer size is limited to 4k:

int
xgbe_calc_rx_buf_size(struct ifnet *netdev, unsigned int mtu)
{
	unsigned int rx_buf_size;

	if (mtu > XGMAC_JUMBO_PACKET_MTU)
		return (-EINVAL);

	rx_buf_size = mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
	rx_buf_size = min(max(rx_buf_size, XGBE_RX_MIN_BUF_SIZE), PAGE_SIZE);
	rx_buf_size = (rx_buf_size + XGBE_RX_BUF_ALIGN - 1) &
	    ~(XGBE_RX_BUF_ALIGN - 1);

	return (rx_buf_size);
}

The line rx_buf_size = min(max(rx_buf_size, XGBE_RX_MIN_BUF_SIZE), PAGE_SIZE); will effectively limit the buffer size to the page size. I suspect that either multiple (possibly continguous) pages would need to be allocated, or maybe the hardware also supports chaining multiple memory pages.

I did some research, and it seems that the AMD Ryzen V1500B is used in a couple of NAS devices from Synology and QNAP, which support 10GBit interfaces with jumbo frames of at least 9000 bytes. I could not find a test with an explicit test of the jumbo frame functionality (i.e. using a ping with appropriate packet sizes), but some people were using jumbo frames with 9000 bytes. At least with Synology, I trust that if one can configure 9000 bytes jumbo frames, then this will also work.

Therefore I assume that this is a driver limitation and not a hardware limitation.

Also see the discussion thread at https://forum.opnsense.org/index.php?topic=29359.0

Environment

OPNsense 25.4 business edition (amd64) DEC2752 (AMD Ryzen V1500B)

kupferk avatar May 10 '25 12:05 kupferk

@kupferk thanks for the writeup, jumbo frames are indeed only limited supported on axgbe at the moment. Durinf most support cases we have seen sofar, customers who wanted to use larger frames often misunderstood when they are useful.

As in most cases traffic is going to a smaller pipe at some point, offloading fragmentation to the firewall is often not a great idea. I'll ask my colleague @swhite2 to take a small look if we can patch this easily on our end when he has some time available.

AdSchellevis avatar May 11 '25 12:05 AdSchellevis

@AdSchellevis thanks for your comments. You are completely right about the questionable usefulness of jumbo frames and fragmentation on the firewall. My use case would be a firewall between internal VLANs, where I would hope to get a better throghput or lower CPU load on the OPNsense itself if jumbo frames were used.

Actually, it would already be helpful if the UI would not allow jumbo frames lager than 4k on axgbe interfaces and if that limitation was mentioned in the specs of the affected Deciso products. That would at least help to clarify the situation and save users time from finding the limitations on their side :)

kupferk avatar May 11 '25 15:05 kupferk

it would be practical to be able to offer validations on mtu sizes, but unfortunately there's no standard defined to account for the maximum a driver supports. The current options originate from https://github.com/opnsense/core/issues/4359, at which time there was also a lot of discussion about the validation.

I'll ask my colleague to check what it would take to support 9000 bytes, there is a chance the boundary check is just a leftover from the older driver of an ancient platform:

https://github.com/opnsense/src/blob/44b781cfe0b561909686778153915ec2b0ba5a21/sys/dev/axgbe/xgbe-drv.c#L266

AdSchellevis avatar May 11 '25 16:05 AdSchellevis

@kupferk Thanks for providing the initial details, the code path you mentioned indeed clamps the buf size, explaining the behavior you're seeing. Fixing this however, only fixes the RX path.

Since iflib handles the allocations and buffer-alignment for DMA, https://github.com/opnsense/src/commit/f2e51f2174229edd6376099a7cc35c1bd60b68d6 should be all that's needed. I verified that ping is working on my end, but perhaps you can do a more extensive field test.

You can test the change with

# opnsense-update -zkr 25.1.6-axgbe

(requires a reboot)

swhite2 avatar May 14 '25 15:05 swhite2

This is wonderful news @swhite2 , thanks for your work! I will try to find a good time window to give your patch a try and provide feedback.

kupferk avatar May 15 '25 06:05 kupferk

@kupferk Great, thanks in advance for testing :)

FWIW, this kernel is also suitable for the BE (25.4), I'll ask Franco to add it to the business mirror as well.

Update: It's on the business mirror, same command

swhite2 avatar May 15 '25 06:05 swhite2

@swhite2 I did a test, with mixed results. Initially, after configuring a MTU of 9000, I could run a "ping -M do -s opnsense" from my Linux client to the OPNsense firewall. But after a couple of minutes with repeated experiments with ping, the OPNsense crashed and rebooted.

I did find a crash file in /var/crash, but the crash location is surprising (at least for me):

Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 03
instruction pointer	= 0x20:0xffffffff80cc2b83
stack pointer	       = 0x28:0xfffffe00b187ec70
frame pointer	       = 0x28:0xfffffe00b187ecc0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 3757 (python3.11)
rdi: fffffe00b187ec78 rsi: fffffe00b187ed18 rdx: fffff80002898700
rcx: fffffe00b187ecd8  r8: fffff800084b69b0  r9: fffffe00b3004430
rax: adacabaaa9a8a7a6 rbx: fffff801bdd6e8c0 rbp: fffffe00b187ecc0
r10: 00000000000000c0 r11: 000000000000001a r12: fffff800084b69b0
r13: 0000000000000000 r14: fffff80002898700 r15: fffffe00b187ed18
trap number		= 9
panic: general protection fault
cpuid = 3
time = 1747337375
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b187e9b0
vpanic() at vpanic+0x131/frame 0xfffffe00b187eae0
panic() at panic+0x43/frame 0xfffffe00b187eb40
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00b187eba0
calltrap() at calltrap+0x8/frame 0xfffffe00b187eba0
--- trap 0x9, rip = 0xffffffff80cc2b83, rsp = 0xfffffe00b187ec70, rbp = 0xfffffe00b187ecc0 ---
vn_statfile() at vn_statfile+0x53/frame 0xfffffe00b187ecc0
kern_fstat() at kern_fstat+0x6b/frame 0xfffffe00b187ed00
sys_fstat() at sys_fstat+0x1d/frame 0xfffffe00b187ee00
amd64_syscall() at amd64_syscall+0x10e/frame 0xfffffe00b187ef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00b187ef30
--- syscall (551, FreeBSD ELF64, fstat), rip = 0x8257fb21a, rsp = 0x8209a7b68, rbp = 0x8209a7ca0 ---

Maybe unrelated? On the other hand, OPNsense was running rock solid before the experiment.

For the time being, I kept the kernel, but set back the MTU to 1500.

kupferk avatar May 15 '25 19:05 kupferk

@kupferk The panic in a seemingly random location is due to general memory corruption. Looking at the code more closely it seems that no matter what value we set for the rx_buf_size, it always gets clamped to the PAGE_SIZE anyway in iflib: https://github.com/opnsense/src/blob/stable/25.1/sys/net/iflib.c#L2454-L2482. For jumbo-sized packets it simply expects multiple segments to form a complete frame (igb and igc do the same thing https://github.com/opnsense/src/blob/stable/25.1/sys/dev/e1000/if_em.c#L1590).

It's likely that when we increase rx_buf_size, telling the NIC the kernel allocated 9024 bytes for each packet and therefore it DMAs the whole content to memory, it starts overwriting in the ring (aligned to PAGE_SIZE buffer sizes). The current assumption is that there are no boundary checks on packet length vs allocated buffer size.

By the way, the stock kernel also receives the jumbo frames (check tcpdump), it's just corrupted data, as the NIC has been instructed there is only 4096 bytes available to write to (probably the last 4096 bytes of the packet DMA'd into memory).

The working theory is that 4096 bytes for a buffer should be perfectly fine, provided the NIC can split larger packets up into multiple segments (multiple descriptors), I just don't see how yet.

swhite2 avatar May 16 '25 14:05 swhite2

Thanks for your investigation @swhite2 . I also tried to do a comparison with the current Linux code, and saw that the clamping is also still in place. Unfortunately, the code diverged from the FreeBSD code a lot over time, so it's not easy (for me) to spot the bits that make the driver work on Linux.

Maybe it's simply not worth the effort (at least not now), if it's not a low hanging fruit :) I suggest to add some documentation for OPNsense/Deciso that jumbo frames currently are limited.

kupferk avatar May 19 '25 06:05 kupferk

@kupferk Thanks for looking into it. For what it's worth, I don't think it works on Linux either. I believe the products you mentioned do not use axgbe for 10Gbit connectivity. If we can confirm that they do it might be worth looking into the difference in default hardware parameters.

I'll leave this ticket open for now as it's unclear to me how much time it will take to fix this. I'll discuss mentioning this in the docs later today :)

swhite2 avatar May 19 '25 06:05 swhite2

Just FYI @swhite2 according to https://forum.opnsense.org/index.php?topic=29359.msg237244#msg237244 jumbo frames are working fine on Linux on a DEC740. I cannot test myself, because I only have a single Deciso device.

kupferk avatar May 19 '25 08:05 kupferk

@kupferk Fair enough, though looking at the driver code on the linux side there are quite some differences in packet handling - most notably the default usage of the split header function (which significantly changes how the hardware marks packets), which we must keep disabled for netmap compatibility. If I have some spare time I will test whether this would make the NIC behave differently.

swhite2 avatar May 20 '25 09:05 swhite2

I did the test with VyOS (Debian based) that is mentioned above.

Today I did a more complex test with a storage device on one 10G (STORAGE_LAN) and a client on the other 10G (CLIENT_LAN) with OPNsense. Mounting the share is fine but creating data with dd let it crash right away.

2025-05-20_opnsense-mtu-9000-testing_crash-log.txt

~~Tomorrow I'll do the same test with VyOS to see how that behaves.~~ Done the tests with the same setup as with OPNsense but with VyOS, no issue with MTU 9000. Data transfers worked in both directions.

boretom avatar May 20 '25 16:05 boretom