AgentBaker icon indicating copy to clipboard operation
AgentBaker copied to clipboard

feat: support ExternalTrafficPolicy Cluster with IPv6 LoadBalancer services

Open tyler-lloyd opened this issue 3 years ago • 3 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

In AKS, nodes act as a router and forward traffic to the container's network namespace. Per RFC 4193, "Routers must not forward any packets with Link-Local source or destination addresses to other links", which means when the link-local source packet arrives on the node to probe for the health of the service, the node cannot forward that packet to a pod network namespace running on that node. This prevents k8s LoadBalancer services from using externalTrafficPolicy: Cluster because Kubernetes is attempting to determine the health of pods backing a service directly (versus ETP: Local, where the health probe is replied to by kube-proxy itself on the node which doesn't need to forward health probe packets).

By changing the src IP of the probe packet to the globally unique address mentioned above, nodes are able to forward the probe packet to the correct destination network namespace thus enabling the use of externalTrafficPolicy: Cluster for IPv6 services in a dual-stack AKS cluster.

This PR includes a new step in CSE, configureCNINFTables, which runs as part of the CNI setup function and will create nftables rules for re-writing the source IP of IPv6 packets from the Azure Load Balancer probe's address link-local address, fe80::1234:5678:9abc, to a globally unique address 2603:1062:0:1:fe80:1234:5678:9abc.

The new rules try to be as non-disruptive as possible and are setup as stateless NAT rules, meaning no connection tracking occurs.

These new rules will only be written for clusters that have IPv6 enabled (currently only dual-stack).

Which issue(s) this PR fixes:

Fixes #

Requirements:

Special notes for your reviewer:

  • Adds nftables to image dependencies. Happy to discuss if we should have these rules just be written with iptables but there might be advantages to trying to move to nftables in the future as it offers better performance and priority.
  • Updates the KubernetesConfig model to include recent updates for IPv6 clusters (ipFamilies, clusterSubnets, and serviceCIDRs)

tyler-lloyd avatar Oct 18 '22 14:10 tyler-lloyd

I'm okay with these changes conceptually but have a few clarifying questions (especially since I am not familiar with the RFC)

By changing the src IP of the probe packet to the globally unique address

Just reiterating...we can't forward the packet because source IP is link local, but we can SNAT it then forward it?

a globally unique address 2603:1062:0:1:fe80:1234:5678:9abc

what is this? looks like it's publicly routable, this is the ipv6 equivalent of 168.63.129.16 for Azure LB?

How is this different from ipv4? Seems like with ipv4 source IP is already 168.63.129.16, is there something semantically wrong with using a non-link local address with ipv6 from the LB side as well (really a question more for SLB team I guess, but unfortunately picking on you)

and yeah, why nftables 🙂 not really opposed, but would love a thorough sanity check that this doesn't mess with e.g. kube-proxy

alexeldeib avatar Oct 18 '22 21:10 alexeldeib

Just reiterating...we can't forward the packet because source IP is link local, but we can SNAT it then forward it?

Yes, if a packet matching fe80::/10 gets to ROUTING with that source address, it gets dropped as "beyond scope"

a globally unique address 2603:1062:0:1:fe80🔢5678:9abc

An IP address assigned permanently by Dawn Bedard from Microsoft owned space specifically for AKS probe SNAT. It'll never be used for anything else, so we don't have to worry about these rules breaking something else down the line.

How is this different from ipv4? Seems like with ipv4 source IP is already 168.63.129.16, is there something semantically wrong with using a non-link local address with ipv6 from the LB side as well (really a question more for SLB team I guess, but unfortunately picking on you)

I talked to SLB extensively about this - they agree with the problem of link-local addresses and that it shouldn't have been done this way, but changing it carries a huge amount of risk and technical debt and would take more than a semester. So, workaround.

and yeah, why nftables 🙂 not really opposed, but would love a thorough sanity check that this doesn't mess with e.g. kube-proxy

You literally can't make this change in iptables - iptables only allows the SNAT action in the NAT table POSTROUTING action and won't let you mangle the source IP before ROUTING, and the packet's already been dropped by then. nftables is more flexible and lets you mangle src early (at priority -300) before the system really sees it.

phealy avatar Oct 18 '22 23:10 phealy

Isn't this going to fail on nodes running new CSE but old node image because they won't have nftables installed? Do we need a command -v nftables >/dev/null || apt-get install -y nftables for Ubuntu

alternatively, we can bake this into the vhd as a systemd unit/oneshot or similar. downside is it keeps it pinned to the node image. benefit is less CSE which we do want. depends on UX

Mariner

let's see if we can get this working on mariner, cc @hbeberman

alexeldeib avatar Oct 18 '22 23:10 alexeldeib

alternatively, we can bake this into the vhd as a systemd unit/oneshot or similar.

latest updates move it to a systemd unit with a 10s delay between restarts if it fails. The script will check if the instance has IPv6 configured with a call to IMDS, looking at network.interface for non-empty IPv6 private IP addresses. The script will also install nftables if it isn't detected.

tyler-lloyd avatar Oct 24 '22 17:10 tyler-lloyd

couple remaining minor comments, LGTM after that 👍

alexeldeib avatar Oct 24 '22 17:10 alexeldeib

I think PR build is bust because we had to force PRs from this repo not forks (one of our PR checks has secrets, TODO to refactor that), would you mind reopening this PR from a branch pushed to Azure/AgentBaker directly? lmk if you need write access

lgtm after that depending on what you want to keep in the unit

alexeldeib avatar Oct 25 '22 20:10 alexeldeib

lmk if you need write access

pretty sure I will. Getting a 403 trying to push to Azure/AgentBaker.

tyler-lloyd avatar Oct 26 '22 12:10 tyler-lloyd

@tyler-lloyd you're in agentbakerwrite: image

maybe need to SSO approve your ssh key/PAT?

alexeldeib avatar Oct 27 '22 23:10 alexeldeib

completed in #2393

tyler-lloyd avatar Nov 03 '22 14:11 tyler-lloyd