kilo Packet loss and configurations are never the same

Hi Squat,

I've recently been debugging some network trouble I've been having and noticed two things:

Packets are being dropped every 30 seconds.

I can only reproduce this on the Kilo WireGuard device. The other WireGuard device I have and the direct IP to the other node do not exhibit any packet loss.

The WireGuard configurations are never the same and get refreshed every 30 seconds:

{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:23:59.877082497Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:24:29.961906586Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:25:00.033606048Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:25:30.07986167Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:26:00.156938971Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:26:30.206563327Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:27:00.274197532Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:27:30.342722177Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:28:00.407631853Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:28:30.480637099Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:29:00.553711949Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:29:30.626646786Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:30:00.726852114Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:30:30.809482011Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:31:00.901018791Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:31:31.010962765Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:32:01.070272956Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:32:31.143815249Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:33:01.208472314Z"}
{"caller":"mesh.go:638","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-04T11:33:31.302365761Z"}

The packet drops seem to line up with whenever the configuration is refreshed in Kilo. Is this a known WireGuard issue? Is it because of the way you change the WireGuard device or iptables?

Nov 04 '20 11:11 SerialVelocity

This is not a normal behavior and is a direct result of the fact that the device is continuously being reconfigured. The configuration should not be changing every 30 seconds; if it does, then there is a bug. In the past, this has occurred as a side effect of setting an endpoint with a DNS name and this name being resolved into an IP by WireGuard, causing the configs to differ.

Can you please share:

The version/hash of Kilo;
Whether or not you are configuring any nodes with an endpoint label containing a DNS name; and
Whether or not you are configuring any Peer resources with a DNS name.

Nov 04 '20 13:11 squat

is a direct result of the fact that the device is continuously being reconfigured

Is there a way to reconfigure the device without dropping any packets?

The version/hash of Kilo;

squat/kilo:latest (I was hoping to switch to a tagged release though)

Whether or not you are configuring any nodes with an endpoint label containing a DNS name; and

All endpoints are IPs

Whether or not you are configuring any Peer resources with a DNS name

My one peer is using an IP

Nov 04 '20 13:11 SerialVelocity

Ok, so I figured out what was causing the configuration difference: two of my nodes had the same kilo.squat.ai/force-internal-ip value. I set them to unused private IPs since the kilo logic picks up the wrong private IP address which causes everything to fail (I have another wireguard device setup on each host). #15 is the feature request related to that.

So, I think, if possible, it would be nice to have:

Kilo to throw an error on bad configurations (like duplicate private IPs)
A way to enable debug logging (or just to have it as part of the info logs) to show the differences between the expected state and the current state.
For the devices to be reconfigured without dropping traffic (is this the wireguard reconfiguration or the iptables reconfiguration?)
~A flag to disable private IPs~ (#15 covers this feature request)

Nov 07 '20 13:11 SerialVelocity

@SerialVelocity great detective work! I really like all of those proposals. For the first one we can definitely log a warning to say "hey! Something looks fishy: i found two nodes claiming to have the same IP!".

The second one would be great!

The third is tricky as this is due to reapplying the WireGuard config, not an IP tables change. It's possible that if we only applied incremental changes vs reapplying the entire config, that the other connections would not be broken, but this needs some research.

Nov 07 '20 14:11 squat

Actually there is more than just (pointless) unused private IPs that causes this. I also have to force kilo to assign unused private IPs but even when they are unique I am having this problem (or at least a very similar one, that in many cases is dropping 16 packets every 29-30 seconds). However, I don't appear to be getting (serious) issues between cluster nodes, only for nodes (my laptop) outside the cluster.

I'm getting

{"caller":"mesh.go:513","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-11-26T01:57:37.48370229Z"}
...

every 30 seconds. Is there any way to get better logs than this to understand what it thinks is not the same?

I am also using only IPs everywhere and am using kilo:latest

Nov 26 '20 01:11 AntonOfTheWoods

Ok, so one of us needs to go-up and lend @squat a hand to implement some extra sanity checking here. I like the idea so much that I have stuck with kilo but I think many wouldn't...

So it turns out that, contrary to normal wireguard setups, the persistentKeepalive: ??? is NOT OPTIONAL when you set up a peer as described at the top of https://kilo.squat.ai/docs/vpn. It doesn't appear there is a minimum - setting it to 200 fixes the issue, so it doesn't need to be less than 30. However, setting it to zero (persistentKeepalive: 0) has the same effect as not putting it. So if you don't put it, for some reason the peer often (though not always...) disappears every 30 seconds or so from the wireguard config on the server, and then magically reappears after 16 seconds. IPtables and routes stay the same, but the wg config on the server disappears without a non-positive persistentKeepalive.

Nov 26 '20 04:11 AntonOfTheWoods

Hi @AntonOfTheWoods, you're definitely on the right track, however it's a little bit more complicated than this. The source of this issue stems from the fact that in Kilo we have two sources of truth for defining WireGuard peers:

Kilo's Peer CRD; and
WireGuard's roaming configuration, which automatically updates the endpoint of a peer as it roams.

Consider the following situation: You define a Peer custom resource for a peer behind NAT. This peer's endpoint is expected to change if persistentKeepalive is disabled or if it is longer than the NAT table entry remains valid. If the endpoint changes, next time Kilo reconciles the WireGuard configuration with the custom resources in the API, it will detect a diff and overwrite the configuration for the peer, interrupting any existing connections.

The naive solution implemented in Kilo today is to put peers into two buckets: those we suspect are behind NAT, and those we do not. We use the following heuristic to detect if a peer is a NAT peer or not: the existence of a non-zero persistentKeepalive value. When a peer is determined to be a NAT peer, then changes to its endpoint are NOT factored into the diff and they are allowed to roam by treating the WireGuard configuration's endpoint value as the source of truth for that peer, see: https://github.com/squat/kilo/blob/master/pkg/mesh/mesh.go#L750-L767.

Instead of using the existence of a persistentKeepalive field, we could instead use the heuristic of the presence of an endpoint field. If a peer has an endpoint defined in the custom resource then it will ALWAYS be considered during diffing, otherwise, it will be ignored and the WireGuard's value will be allowed to override.

WDYT?

Nov 26 '20 09:11 squat

Thanks for the quick response @squat !

If the endpoint changes, next time Kilo reconciles the WireGuard configuration with the custom resources in the API, it will detect a diff and overwrite the configuration for the peer, interrupting any existing connections.

Are you sure it's only if the endpoint changes? I was sitting in my room doing a ping - would the endpoint have changed without changing the IP? So my upstream router is changing the port after less than 30 seconds even though it's getting a ping packet every second?

The naive solution implemented in Kilo today is to put peers into two buckets: those we suspect are behind NAT, and those we do not. We use the following heuristic to detect if a peer is a NAT peer or not: the existence of a non-zero persistentKeepalive value. When a peer is determined to be a NAT peer, then changes to its endpoint are NOT factored into the diff and they are allowed to roam by treating the WireGuard configuration's endpoint value as the source of truth for that peer, see: https://github.com/squat/kilo/blob/master/pkg/mesh/mesh.go#L750-L767.

Instead of using the existence of a persistentKeepalive field, we could instead use the heuristic of the presence of an endpoint field. If a peer has an endpoint defined in the custom resource then it will ALWAYS be considered during diffing, otherwise, it will be ignored and the WireGuard's value will be allowed to override.

WDYT?

Though I do think the presence/absence of an endpoint field makes more sense, I think even keeping it as it is today is Ok, as long as you have a warning in the docs. While doing assuming on the behalf of the user is obviously fine, documenting those assumptions where it makes a difference can save a metric *&%^-ton of time for users like me. I am very new to wireguard, and had just migrated from using a wg-quick up on the server where I didn't need to put any sort of persistent keepalive for it to work (at least in my reasonably brief testing).

Basically, I don't know enough about wireguard (or VPNs) to have an (informed) opinion on whether not having a persistent keepalive and being behind NAT is a really stupid configuration to have. If it is really stupid, then a warning (even at the bottom of the page) and keep it as is. If it isn't really stupid, and just what most people have because most of the examples on stackoverflow currently have that, then move to the presence of an endpoint in the peer conf.

Does that sound reasonable?

EDIT 28/11/2020 Actually, come to think of it, one of the reasons I removed the persistent keepalive is because I am not behind NAT, and I definitely read somewhere that if you don't have NAT (or a temperamental firewall) then it has no benefit. Is that correct from a wg standpoint? If that's the case, then it doesn't seem logical at all to assume that all users will necessarily be in that situation, or consider that they will necessarily have a persistent keepalive, even though it provides them no benefit (and will generate pointless network traffic). So, if these are all accurate, I actually put a strong vote in for the endpoint option.

Nov 27 '20 00:11 AntonOfTheWoods