feat: support ExternalTrafficPolicy Cluster with IPv6 LoadBalancer services
What type of PR is this?
/kind feature
What this PR does / why we need it:
In AKS, nodes act as a router and forward traffic to the container's network namespace. Per RFC 4193, "Routers must not forward any packets with Link-Local source or destination addresses to other links", which means when the link-local source packet arrives on the node to probe for the health of the service, the node cannot forward that packet to a pod network namespace running on that node. This prevents k8s LoadBalancer services from using externalTrafficPolicy: Cluster because Kubernetes is attempting to determine the health of pods backing a service directly (versus ETP: Local, where the health probe is replied to by kube-proxy itself on the node which doesn't need to forward health probe packets).
By changing the src IP of the probe packet to the globally unique address mentioned above, nodes are able to forward the probe packet to the correct destination network namespace thus enabling the use of externalTrafficPolicy: Cluster for IPv6 services in a dual-stack AKS cluster.
This PR includes a new step in CSE, configureCNINFTables, which runs as part of the CNI setup function and will create nftables rules for re-writing the source IP of IPv6 packets from the Azure Load Balancer probe's address link-local address, fe80::1234:5678:9abc, to a globally unique address 2603:1062:0:1:fe80:1234:5678:9abc.
The new rules try to be as non-disruptive as possible and are setup as stateless NAT rules, meaning no connection tracking occurs.
These new rules will only be written for clusters that have IPv6 enabled (currently only dual-stack).
Which issue(s) this PR fixes:
Fixes #
Requirements:
- [ ] uses conventional commit messages
- [ ] includes documentation
- [x] adds unit tests
- [x] tested upgrade from previous version
Special notes for your reviewer:
- Adds
nftablesto image dependencies. Happy to discuss if we should have these rules just be written with iptables but there might be advantages to trying to move to nftables in the future as it offers better performance and priority. - Updates the KubernetesConfig model to include recent updates for IPv6 clusters (ipFamilies, clusterSubnets, and serviceCIDRs)
I'm okay with these changes conceptually but have a few clarifying questions (especially since I am not familiar with the RFC)
By changing the src IP of the probe packet to the globally unique address
Just reiterating...we can't forward the packet because source IP is link local, but we can SNAT it then forward it?
a globally unique address 2603:1062:0:1:fe80:1234:5678:9abc
what is this? looks like it's publicly routable, this is the ipv6 equivalent of 168.63.129.16 for Azure LB?
How is this different from ipv4? Seems like with ipv4 source IP is already 168.63.129.16, is there something semantically wrong with using a non-link local address with ipv6 from the LB side as well (really a question more for SLB team I guess, but unfortunately picking on you)
and yeah, why nftables 🙂 not really opposed, but would love a thorough sanity check that this doesn't mess with e.g. kube-proxy
Just reiterating...we can't forward the packet because source IP is link local, but we can SNAT it then forward it?
Yes, if a packet matching fe80::/10 gets to ROUTING with that source address, it gets dropped as "beyond scope"
a globally unique address 2603:1062:0:1:fe80🔢5678:9abc
An IP address assigned permanently by Dawn Bedard from Microsoft owned space specifically for AKS probe SNAT. It'll never be used for anything else, so we don't have to worry about these rules breaking something else down the line.
How is this different from ipv4? Seems like with ipv4 source IP is already 168.63.129.16, is there something semantically wrong with using a non-link local address with ipv6 from the LB side as well (really a question more for SLB team I guess, but unfortunately picking on you)
I talked to SLB extensively about this - they agree with the problem of link-local addresses and that it shouldn't have been done this way, but changing it carries a huge amount of risk and technical debt and would take more than a semester. So, workaround.
and yeah, why nftables 🙂 not really opposed, but would love a thorough sanity check that this doesn't mess with e.g. kube-proxy
You literally can't make this change in iptables - iptables only allows the SNAT action in the NAT table POSTROUTING action and won't let you mangle the source IP before ROUTING, and the packet's already been dropped by then. nftables is more flexible and lets you mangle src early (at priority -300) before the system really sees it.
Isn't this going to fail on nodes running new CSE but old node image because they won't have nftables installed? Do we need a command -v nftables >/dev/null || apt-get install -y nftables for Ubuntu
alternatively, we can bake this into the vhd as a systemd unit/oneshot or similar. downside is it keeps it pinned to the node image. benefit is less CSE which we do want. depends on UX
Mariner
let's see if we can get this working on mariner, cc @hbeberman
alternatively, we can bake this into the vhd as a systemd unit/oneshot or similar.
latest updates move it to a systemd unit with a 10s delay between restarts if it fails. The script will check if the instance has IPv6 configured with a call to IMDS, looking at network.interface for non-empty IPv6 private IP addresses. The script will also install nftables if it isn't detected.
couple remaining minor comments, LGTM after that 👍
I think PR build is bust because we had to force PRs from this repo not forks (one of our PR checks has secrets, TODO to refactor that), would you mind reopening this PR from a branch pushed to Azure/AgentBaker directly? lmk if you need write access
lgtm after that depending on what you want to keep in the unit
lmk if you need write access
pretty sure I will. Getting a 403 trying to push to Azure/AgentBaker.
@tyler-lloyd you're in agentbakerwrite:

maybe need to SSO approve your ssh key/PAT?
completed in #2393