core icon indicating copy to clipboard operation
core copied to clipboard

dhcrelay: can get stuck with 100% CPU usage in new implementation

Open alexandredulche opened this issue 1 year ago • 4 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [X] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

Since upgrade to 24.1.6 Opnsense goes into 100% CPU usage from one of the dhcp relay processes. This happens randomly after a few hours or days of uptime.

To Reproduce

Steps to reproduce the behavior:

  1. Go to 'Service > DHCRelay' *
  2. Create a list of destination DHCP servers (in my case one local, one remote over site-to-site VPN)*
  3. Create a configuration on multiple interfaces (VLAN) including management VLAN *
  4. Tick "Agent Information" *
  5. Save *
  6. Let it run (and observer CPU usage)

*In my case the new relay configuration was created automatically upon upgrading to 24.1.6 (destination list is called "Migrated IPv4 server entry")

Expected behavior

No heavy CPU usage should come from DHCP relay service.

Describe alternatives you considered

I tried disabling the DHCP relay on my management VLAN where my destinations DHCP servers reside. I tried creating a CRON job to restart the dhcp relay service every hour but it's not working. Now I have a CRON job rebooting the VM every morning. Upgraded to 24.1.7 today (waiting for the issue to reappear)

Relevant log files

I don't know where to find logs for the new DHCP relay service.

Additional context

Apparently I'm not the only one facing the issue ; https://forum.opnsense.org/index.php?topic=40126.0 https://forum.opnsense.org/index.php?topic=40284.0

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 24.1.6-amd64

My setup :

Edge sites (x2) :

  • ESXi 8
  • OpenSense VM as main gateway
  • OpenSense VM as "helper" with DHCP relay for multiple VLANs
  • Multiple VLANs
  • Unifi switches
  • Windows Server VM with DHCP server (as standby)

Central site :

  • ESXi 8
  • OpenSense VM as main gateway
  • OpenSense VM as "helper" with DHCP relay for multiple VLANs
  • Multiple VLANs
  • Unifi switches
  • Windows Server VM with DHCP server (as standby)

Site-to-site Wireguard VPN

No DHCP guarding whatsoever on Unifi side.

Opnsense VMs (router and helper) all have an interface in each VLAN. Target DHCP servers on edge sites are both the local and the central Windows DHCP server.

This setup worked flawlessly for months (if not years) before 24.1.6.

alexandredulche avatar May 19 '24 20:05 alexandredulche

We are seeing the same issue. We need to restart the dhcrelay service once every 2 or 3 days to get DHCP Relay functionality working again. Even after the OPNsense 24.1.7_4-amd64 update.

Unit764 avatar May 23 '24 15:05 Unit764

We are currently debugging the issue but the problem is elusive. It seems to hit an error condition in the BPF packet capture that the daemon can't recover from. We will publish updates as we encounter them.

fichtner avatar May 24 '24 06:05 fichtner

We are having the same issue since 24.1.6. Restarting the DHCPv4 Relay services solves the issue for a few minutes.

Will the fix only be available in 24.7 or can we hope for a hotfix in any of the 24.1.x releases?

browne-net avatar May 24 '24 07:05 browne-net

A fix will be available quickly as it is found for all supported version.

fichtner avatar May 24 '24 07:05 fichtner

To test https://github.com/opnsense/dhcrelay/pull/1, install using the command below and re-apply the config via the gui.

REDACTED (see below)

AdSchellevis avatar May 27 '24 09:05 AdSchellevis

While debugging and writing this we found that FreeBSD has 3 fixes way back from 2005/2006 in the tree for this particular code derived from dhclient which all originates from common ISC code and perfectly fits the problem.

https://github.com/freebsd/freebsd-src/commit/4eae015 https://github.com/freebsd/freebsd-src/commit/289d89d80 https://github.com/freebsd/freebsd-src/commit/ebe609b4a27

Here is a test package with the FreeBSD changes instead of the previous PR state by @AdSchellevis

# pkg add -f https://pkg.opnsense.org/FreeBSD:13:amd64/snapshots/misc/dhcrelay-0.4_4.pkg

All feedback on both binaries is welcome.

fichtner avatar May 27 '24 09:05 fichtner

Installed dhcrelay-0.4_4 4 hours ago and the issue seems to be gone. DHCRelay is working fine now and no high CPU usage visible.

TheHellSite avatar May 27 '24 14:05 TheHellSite

@TheHellSite woohoo! tentatively at least :)

fichtner avatar May 27 '24 14:05 fichtner

@fichtner We are also successfully running the patch for about 24h now without noticing any issues. DHCRelay is working fine again. I think this can be closed.

browne-net avatar May 28 '24 08:05 browne-net

@browne-net thanks we will ship in 24.1.8 tomorrow

fichtner avatar May 28 '24 08:05 fichtner

dhcrelay seems to be dropping BOOTREPLY messages if the source IP of the REPLY does not match the destination IP specified in the UI.

Previously, I was able to specify the VIP address of my DHCP servers in the DHCP relay config. BOOTREPLY from a different source IP (e.g. physical NIC of the active server) would still be forwarded to the client.

mileyceberus avatar May 29 '24 07:05 mileyceberus

@mileyceberus feel free to open a new ticketl, but I don't quite understand what "VIP address of my DHCP servers" means. It just takes an address. It can be any address.

fichtner avatar May 29 '24 07:05 fichtner

when it's about source address, source nat is likely the place to look :)

AdSchellevis avatar May 29 '24 07:05 AdSchellevis

@fichtner, no problem. Happy to open a new ticket as required.

I was referring to the virtual ip (VIP/CARP) of my dhcp servers.

In the past, I could point dhcrelay to a VIP/CARP address. dhcrelay would simply pass the OFFER messages to the clients regardless of the source addresses (as these could change depending on which server is active).

However, this behaviour seems to have changed.

mileyceberus avatar May 29 '24 07:05 mileyceberus

@AdSchellevis Thanks for the suggestion. I have made the change on my side and it seems to have resolved the issue.

For the benefit of those who may be experiencing similar issues, this is what I did on my DHCP servers.

iptables -t nat -A POSTROUTING -o <OUTBOUND_INTERFACE> -p udp --sport 67 -j SNAT --to <VIRTUAL_IP>

mileyceberus avatar May 29 '24 08:05 mileyceberus