src icon indicating copy to clipboard operation
src copied to clipboard

packet errors on outgoing traffic

Open fabricemrchl opened this issue 3 years ago • 21 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/src/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/src/issues?q=is%3Aissue

Describe the bug

On high load, there are errors out on VLAN interfaces. No error on parent interface. In my specific case suricata is enable. At suricata startup some errors are counting , less than 10 per interfaces. When I disable suricata there is no error on high load.

I found 2 issues similar, they were close due to new version of OPNSense : #87 #74 (suricata is not in use for this case)

To Reproduce

Steps to reproduce the behavior:

  1. Configure VLAN on interface igb1
  2. Make some high load traffic on the interface (speed test, torrent download)
  3. See error out on interface counter

Expected behavior

Error count should be at 0

Describe alternatives you considered

I start with OPNSense 22.1.2 in VM with virtio and e1000 drivers. I though it was an issue with this specific setup. Then I received DEC850 appliance and the result is the same. There are errors out on VLAN interfaces.

Screenshots

If applicable, add screenshots to help explain your problem. Error_count_2

Environment

Current setup :

  • Appliance DEC850
  • OPSense Business edition 21.10.3
  • tested igb and ax interfaces

Test setup :

  • VM on proxmox 7
  • OPNSense community 22.1.2
  • tested virtio and e1000 interfaces

fabricemrchl avatar Mar 21 '22 19:03 fabricemrchl

Earlier investigation revealed that timing during initialization of a vlan interface was a bit off, therefore packets were sent out over a vlan interface that wasn't up yet, revealing some packet errors. These errors didn't have any impact, but also limited itself to about 3-10 errors maximum during interface bootup.

Seeing so many errors suggests problems elsewhere in the stack - but somehow related to the vlan transmit routine. I'll take a closer look.

Let me emphasize though that these errors are negligible and probably have no real impact.

swhite2 avatar Mar 25 '22 15:03 swhite2

@fabricemrchl I'm unable to reproduce the issue on my setup using suricata + high load traffic. Would you be able to share some debug output? The following would be helpful:

  • netstat -s
  • hardware-specific counters using sysctl -a | grep <device name, e.g. igb>

swhite2 avatar Mar 31 '22 06:03 swhite2

Hi @swhite2 Please find debug output in attachment .

sysctl.txt netstat.txt

fabricemrchl avatar Mar 31 '22 20:03 fabricemrchl

@fabricemrchl Thanks for the output. Do you have an estimate of the amount of outbound errors at the time of the recording of this output? I'm wondering how it scales in ratio since the amount of packets are a lot more than the original screenshot.

swhite2 avatar Apr 01 '22 06:04 swhite2

Around 2100Go in and 550Go out when I recorded previous output. Ratio is low, during the last 10days I did not have much traffic that generate error. 10days_uptime_OPNSense

I restarded OPNSense and I started a torrent download through Wireguard VPN (VPN and torrent client on computer not on router). You can find bellow stats and debug output. Ratio is much higher. Error increase quickly with this kind of traffic so I use it to test and reproduce, but it can occur with other kind of traffic. sysctl2.txt netstat2.txt 30min_uptime_OPNSense

fabricemrchl avatar Apr 01 '22 11:04 fabricemrchl

While I can't draw any conclusions from the data here, I noticed the VLAN virtual interface is very sensitive to output errors in it's transmit routine. Most notably it reports errors when:

  • The parent interface is not up and running, I have been able to reproduce this by restarting a parent interface while packets are going through.
  • The kernel isn't able to prepend a valid 802.1Q header.
  • The parent interfaces' transmit routine fails for any reason (e.g. no buffers available).

Especially the last point is something that seems unique to VLAN virtual interfaces. Normally interfaces do not report this as an outbound interface error as far as I can see.

Other things you can try:

  • add an entry in system->settings->tunables: net.link.vlan.soft_pad to 1.
  • Toggling VLAN hardware filtering (either globally or per interface).

If you're able, the output of dmesg would also be very helpful.

swhite2 avatar Apr 01 '22 15:04 swhite2

I set net.link.vlan.soft_pad to 1 , no change errors still counting (router restarded after adding this setting) Currently VLAN hardware filtering is disable on my setup. I read almost everywhere that Suricata in IPS mode require to disable this feature. Is it safe to enable it with Suricata IPS mode and promiscious mode enable ?

Please find dmesg and others debug output associated : debug_systclt3.txt debug_netstat3.txt debug_dmesg.txt In/out packets 3730493 / 11398747 (1009.10 MB / 14.95 GB) In/out packets (pass) 3730095 / 11398747 (1009.07 MB / 14.95 GB) In/out packets (block) 34711 / 0 (398 bytes / 0 bytes) In/out errors 0 / 21654

fabricemrchl avatar Apr 02 '22 07:04 fabricemrchl

Which interface is running Suricata (IPS)? As is indicated in the GUI and the docs, IPS shouldn't be run directly on VLAN interfaces, only on it's parent interface. And yes, in IPS mode VLAN hardware filtering should be disabled. You could try switching to IDS mode and toggling VLAN hardware filtering to see if this changes anything to rule out IPS being the culprit.

At suricata startup some errors are counting , less than 10 per interfaces

These errors are related to the interface startup and can be safely ignored.

I'm noticing that the netmap setup is ignoring the RX and TX descriptor overrides and is reverting back to 4 CPUs in your dmesg output (though unclear which interface this relates to), could you try setting net.inet.rss.enabled to 0 in the tunables section and reboot the system? You can also force traffic flow to the driver over 1 CPU by setting dev.ax.0.rss_enabled to 0 as a tunable.

Since you're running on ax0, output from sysctl dev.ax.0 can also be helpful to rule out specific TX errors in the hardware.

swhite2 avatar Apr 04 '22 08:04 swhite2

Parent interface, named MANAGEMENT or SERVER in the last debug output, is running Suricata(IPS). I tried Suritica in IDS mode, also on parent interface, no error with this mode. I believe that the IDS mode does not use NETMAP, so either the problem comes from Suricata or from NETMAP. I didn't enable VLAN hardware filtering. Switching from IPS to IDS with VLAN hardware filtering disable is enough to stop error.

I tried Suricata IPS mode with net.inet.rss.enabled and dev.ax.0.rss_enabled set to 0, error still counting, not better not worse.

sysctl dev.ax.0 result : debug_systclt4.txt

I will have a look to Suricata bug report to check if there is a known issue. Maybe I need to tune some Suricata settings to fix those errors, If you have some idea on how to tune it, I can test it. Suricata running config dump : suricata_config.txt

fabricemrchl avatar Apr 04 '22 11:04 fabricemrchl

I've been able to reproduce the errors on my end with IPS on the parent interface. It seems Netmap is the culprit here somewhere since I've built a custom kernel removing https://github.com/opnsense/src/blob/stable/22.1/sys/net/if_vlan.c#L1260, and observing that the outbound errors remain at 0.

Why the transmit function fails is still unclear to me, but Suricata isn't the issue here.

swhite2 avatar Apr 06 '22 12:04 swhite2

Hello, Do you have some news about this issue? Can I help you in anyway?

fabricemrchl avatar Apr 27 '22 20:04 fabricemrchl

Hi,

This issue is still very much on my to-do list and I hope I can get back to you by the end of the week.

swhite2 avatar May 02 '22 19:05 swhite2

Hi @fabricemrchl,

Apologies for the later-than-expected reply, but it took some time to configure a working tracing setup due to regressions in the FreeBSD13-STABLE kernel. In any case, here is a preliminary result:

(running an iperf3 network test for ~5 seconds, OPNsense as a client to generate a lot of outbound traffic)

dtrace -n 'fbt::vlan_transmit:return { @ = lquantize((int32_t)arg1, 0, 100, 1); }' - tracing the return codes of the vlan_transmit function.

            value  ------------- Distribution ------------- count
             < 0 |                                           0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   295498
               1 |                                           0
               2 |                                           0
               3 |                                           0
                 .
                 .
                 .
              55 |                                           5
              56 |                                           0

In an error situation, the return code is 55.

According to https://www.freebsd.org/cgi/man.cgi?query=errno&sektion=2&manpath=freebsd-release-ports:

55	ENOBUFS	No buffer space	available.  An operation on a socket or	pipe
	     was not performed because the system lacked sufficient buffer
	     space or because a	queue was full.

~~Again, normally when this happens, interfaces don't report these as outbound errors.~~ It seems most virtual interfaces actually do report these as outbound errors.

swhite2 avatar May 11 '22 13:05 swhite2

@fabricemrchl Update:

Since Suricata in it's current state only uses one thread to pass packets up to the host stack, it's easy to imagine buffers being exhausted, as Suricata is probably processing packets faster than the host stack can receive/transmit them.

When running the Suricata 6.0.5-devel package (part of the OPNsense development firmware), the opposite is true - multiple threads are used to tackle this issue, increasing throughput by about ~1.5 Gbit/s, while simultaneously eliminating the outbound errors.

I have found no workaround for the outbound errors in the current state, it causes some congestion at most when operating at line speed. It is very unlikely a system is fully satured all of the time, so these errors are spurious. Until the Suricata package is in a working stable state, there isn't much I can do (I have experimented with netmap and system tunables). You are of course free to switch to the development version should your setup allow for such a thing :)

swhite2 avatar May 11 '22 15:05 swhite2

Hi @swhite2 , Thank you for your debug and information. No problem for the workaround, if the issue is fixed in the next Suricata release it will be perfect. Currently I'm running OPNSense 22.4 version. If there is an easy path to test Suricata 6.0.5-devel and revert to 6.0.4_1 I can test it. I have only one router so I can't switch to devel branch. If I can upgrade only Suricata package it's OK to test on my side. If not I will wait for an non-devel release.

fabricemrchl avatar May 11 '22 19:05 fabricemrchl

Unfortunately no, the easiest way to switch is to switch to the OPNsense-devel package entirely. This replaces the core package as well. The only other way to isolate it is to build suricata-devel from source, deinstalling the current version (without using pkg) and installing the newly built one. If you'd like to try this I can provide instructions for it, but be aware it's not ready for a production environment.

It is unclear when the Suricata package with the netmap changes is ready for release. There are still known bugs causing potential lockups.

swhite2 avatar May 12 '22 08:05 swhite2

No problem, I will wait for production ready package.

fabricemrchl avatar May 12 '22 21:05 fabricemrchl

I tried to solve this exact same problem for days since we got a 1Gbit/s internet connection. We are having around 60.000 errors per 100GB, and thus sometimes failed downloads etc..

I solved this by shaping the WAN speed to 500MBit/s which does stop interface errors. Not a great solution but it works.

geludwig avatar May 13 '22 17:05 geludwig

"On high load, there are errors out on VLAN interfaces. No error on parent interface."

I have this same issue on VLAN interface on my WAN. My ISP uses PPPoE over VLAN. I sometimes randomly get a lot of errors out on VLAN 6 on WAN interface, and that basically drops my internet for a minute, and it needs to be re-negotiated then. Of course, this is very annoying. I am not even using IDS/IDP, just a basic setup. Any way I can troubleshoot this?

Thanks

kalpik avatar Aug 17 '22 08:08 kalpik

Is it the errors dropping your connection or is there any other form of flapping going on (which in turn would cause outbound errors to accumulate )? Maybe check the dmesg output for recurring linkup/linkdown messages, as well as the general system log.

swhite2 avatar Aug 17 '22 08:08 swhite2

I can only correlate it to errors. I see the PPPoE connection drop because there's no ECHO reply. And this coincides with the sudden accumulation of errors on the WAN VLAN 6.

image

kalpik avatar Aug 17 '22 09:08 kalpik

Suricata 6.0.9 have implemented new netmap API (https://redmine.openinfosecfoundation.org/versions/184) and this version is built in OpnSense 22.7.9 Is that mean it will improve this issue or do you need to implement something else on Opnsense side ?

I'm currently on business edition so I can't test it now.

fabricemrchl avatar Dec 01 '22 15:12 fabricemrchl

Nothing will change for our suricata 6. you can test the newer netmap changes since at least half a year, but I doubt it does any magic here.

fichtner avatar Dec 01 '22 15:12 fichtner

I misunderstood something. I though the Suricata devel package mentioned by swhite2 here was about new netmap API. So any ETA about Suricata devel package with multi thread support ?

fabricemrchl avatar Dec 01 '22 16:12 fabricemrchl

No ETA, it’s been there for a long time as I said.

fichtner avatar Dec 01 '22 16:12 fichtner

The new netmap api is already being used in 23.7.

fichtner avatar Jan 05 '24 08:01 fichtner