Repeated attempts to reconcile mesh network
I have an issue where when I connect an outside peer (eg. my laptop) to the cluster, kilo sees that configurations aren't the same and recreates the mesh to reconcile the differences. However, the config is never as expected and kilo will constantly attempt to reconcile, killing the network every ~30 seconds
I'm going to keep debugging, but I created this issue just in case you know what's up before I spend time here.
I added some prints to see what was going on:
level.Info(logger).Log("reason", "peer endpoints", "c", c, "b", b)
| B | C |
|
|
Turns out my laptop peer, 10.5.0.1, has a configured endpoint in oldConf, b, but is null in the new conf, c, and that's what's causing kilo to reconcile the differences
i think it is because your laptop's endpoint is discovered since #146 and now Kilo wants to reapply the spec of your Laptop's peer that has a nil endpoint because the actual endpoint has been added and spec and reality have diverged. Let me check why I haven't noticed this with my laptop. Maybe this is wrong.
What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0: https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275
What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0:
https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275
Brilliant, that's exactly what's happening. I've added a persistentKeepalive and the network stays stable.
Defining a peer with a persistent keep alive of 0
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: laptop
spec:
allowedIPs:
- 10.5.0.1/32
publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
persistentKeepalive: 0
Still sees kilo attempt to reconcile the mesh network; line 3, 30~ seconds after apply:
{"caller":"mesh.go:344","component":"kilo","event":"add","level":"info","peer":{"PublicKey":[75,56,108,29,170,111,39,46,181,186,188,199,76,11,241,220,139,187,0,217,78,248,241,155,176,252,191,152,166,60,83,93],"Remove":false,"UpdateOnly":false,"PresharedKey":null,"PersistentKeepaliveInterval":0,"ReplaceAllowedIPs":false,"AllowedIPs":[{"IP":"10.5.0.1","Mask":"/////w=="}],"Endpoint":null,"Name":"laptop"},"ts":"2022-05-25T00:50:29.118108442Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=1, new=2","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:29.16908714Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"peer endpoints: nil value","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:59.040795773Z"}
Is the intention of this code-path to prevent mesh reconciliation if pka == nil || pka == 0? Or am I misunderstanding?
https://github.com/squat/kilo/blob/4be792ea543a9c2656574ec060b335c587244a3d/pkg/mesh/topology.go#L291
FWIW, I'm not bothered about keeping otherwise silent connections alive through NAT
Some mysterious behaviour I don't quite understand; I have a peer configuration called phone that is intended for my well, uh, phone, which didn't cause mesh reconciliation—I'm tailing kilo's logs. My phone is connected to the same WiFi network, there's no cellular involved here.
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: laptop
spec:
allowedIPs:
- 10.5.0.1/32
publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
persistentKeepalive: 0
---
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: phone
spec:
allowedIPs:
- 10.5.0.2/32
publicKey: urgVgSoHEwG5/7q0k5NpjWSBpAyxPfhvdT/v0zd561o=
persistentKeepalive: 0
Taking a stab in the dark that something is up with the laptop peer, I created a third peer, dummy, and connected from my laptop. No good; there's mesh reconciliation there too.
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: dummy
spec:
allowedIPs:
- 10.5.0.3/32
publicKey: AzckRiPfM30PNbyX/kxCv59YlIfaoj/hVU7LPkxuuAw=
persistentKeepalive: 0
Okay, so now thinking something is up with the clients, I migrate the laptop peer config to my phone and connect from there. No good; reconciliation again. I try dummy from my phone. Also reconciliation.
So now the reverse—export the phone peer and import it on my laptop. Strange—there's no reconciliation at all. For whatever reason the phone peer doesn't cause any undesired behaviour.
I moved the private key from dummy to phone, kept the rest the same; mesh reconciliation.
Reset phone back to the original keypair—no reconciliation.
🤯