Mainnet: lost internal networking (wg) on multi deployment.
We have the following vm's running on mainnet Farm 77. They are all inside an internal subnet 10.42.0.0/16. https://gist.github.com/coesensbert/669f04271f898b1a4a59ea818cd32dd3
After a few days after deploying the tfchainmainpub040X nodes stopped connecting over the internal subnet. This is needed because one of the nodes is a caddy loadbalancer that exposes the other public tfchain nodes over the internal network.
This network is also deployed with a wireguard config, so our monitoring server connects via this vpn to monitor all the vm's in the internal subnet. All this communication still works. The Monitoring server can reach all vm's, but the vm's can't reach each other.
We did some checks on ZOS nodes 311 and 312 where it seems that wireguard stopped sending handshakes for some reason: 311:
~ # ip net
public
n-MxMyNXzpwqJdb (id: 0)
ndmz
~ # ip net exec n-MxMyNXzpwqJdb wg show
interface: w-MxMyNXzpwqJdb
public key: hCDeyU3UaR6apCPZeJFLUL5DBDQNM4nAbqsbhQ5e/XQ=
private key: (hidden)
listening port: 5827
peer: G52pfjX0gPU8+EA7Oi+FqmHeql3ItCA2OW0ySXyHKiI=
endpoint: 146.185.93.124:2759
allowed ips: 10.42.8.0/24, 100.64.42.8/32
latest handshake: 12 seconds ago
transfer: 2.39 MiB received, 1.52 MiB sent
persistent keepalive: every 20 seconds
peer: slJsHiV3fbRu+pq9hHXJEkclfI3dkk2PbgnsTjBHek4=
endpoint: 146.185.93.123:2713
allowed ips: 10.42.9.0/24, 100.64.42.9/32
latest handshake: 49 seconds ago
transfer: 760.66 KiB received, 1.09 MiB sent
persistent keepalive: every 20 seconds
peer: vt29BpiJ47DTLL75w5ZpxBNoK5K9XUkGaQdcOXBpum8=
endpoint: 146.185.93.125:7366
allowed ips: 10.42.10.0/24, 100.64.42.10/32, 10.42.2.0/24, 100.64.42.2/32
latest handshake: 1 minute, 21 seconds ago
transfer: 29.91 MiB received, 377.69 MiB sent
persistent keepalive: every 20 seconds
peer: cxJndByOdEvW6wtweLblPbbOb2Zm+2ETPU2dNkv9d2U=
endpoint: 146.185.93.114:6085
allowed ips: 10.42.4.0/24, 100.64.42.4/32
transfer: 0 B received, 11.50 MiB sent
persistent keepalive: every 20 seconds
peer: yk1mmgm0rY/f07nqUohZWIJyGUb5qZTSRYj7ZKyLmmk=
endpoint: 146.185.93.112:6016
allowed ips: 10.42.5.0/24, 100.64.42.5/32
transfer: 0 B received, 11.44 MiB sent
persistent keepalive: every 20 seconds
peer: 1r5LpRlRKrDME/HaMt4UsKZJkx7TKJctlu1g5shDjws=
endpoint: 146.185.93.119:5267
allowed ips: 10.42.6.0/24, 100.64.42.6/32
transfer: 0 B received, 11.45 MiB sent
persistent keepalive: every 20 seconds
peer: FlpS50NOa2UtI3U9LKBrCrx/8kDA2hhVXyZHTA99izQ=
endpoint: 146.185.93.120:7885
allowed ips: 10.42.7.0/24, 100.64.42.7/32
transfer: 0 B received, 11.45 MiB sent
persistent keepalive: every 20 seconds
~ # ip net exec n-MxMyNXzpwqJdb ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: n-MxMyNXzpwqJdb@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 0e:7b:52:4c:90:d7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.42.3.1/24 brd 10.42.3.255 scope global n-MxMyNXzpwqJdb
valid_lft forever preferred_lft forever
inet6 fd4d:784d:794e:3::1/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::c7b:52ff:fe4c:90d7/64 scope link
valid_lft forever preferred_lft forever
inet6 fe80::1/64 scope link
valid_lft forever preferred_lft forever
3: public@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 82:05:d4:d1:ea:1a brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 100.127.0.6/16 brd 100.127.255.255 scope global public
valid_lft forever preferred_lft forever
inet6 fd00::6/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::8005:d4ff:fed1:ea1a/64 scope link
valid_lft forever preferred_lft forever
28: w-MxMyNXzpwqJdb: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
link/none
inet 100.64.42.3/16 brd 100.64.255.255 scope global w-MxMyNXzpwqJdb
valid_lft forever preferred_lft forever
312:
~ # ip net
n-WeFyBhsM1KxVT (id: 2)
public
n-MxMyNXzpwqJdb (id: 1)
n-JL8VRacJ2aum9 (id: 0)
ndmz
~ # ip net exec n-MxMyNXzpwqJdb wg show
interface: w-MxMyNXzpwqJdb
public key: cxJndByOdEvW6wtweLblPbbOb2Zm+2ETPU2dNkv9d2U=
private key: (hidden)
listening port: 6085
peer: G52pfjX0gPU8+EA7Oi+FqmHeql3ItCA2OW0ySXyHKiI=
endpoint: 146.185.93.124:2759
allowed ips: 10.42.8.0/24, 100.64.42.8/32
latest handshake: 44 seconds ago
transfer: 760.60 KiB received, 1.09 MiB sent
persistent keepalive: every 20 seconds
peer: vt29BpiJ47DTLL75w5ZpxBNoK5K9XUkGaQdcOXBpum8=
endpoint: 146.185.93.125:7366
allowed ips: 10.42.10.0/24, 100.64.42.10/32, 10.42.2.0/24, 100.64.42.2/32
latest handshake: 55 seconds ago
transfer: 29.43 MiB received, 378.47 MiB sent
persistent keepalive: every 20 seconds
peer: slJsHiV3fbRu+pq9hHXJEkclfI3dkk2PbgnsTjBHek4=
endpoint: 146.185.93.123:2713
allowed ips: 10.42.9.0/24, 100.64.42.9/32
latest handshake: 1 minute, 32 seconds ago
transfer: 5.72 GiB received, 10.78 GiB sent
persistent keepalive: every 20 seconds
peer: hCDeyU3UaR6apCPZeJFLUL5DBDQNM4nAbqsbhQ5e/XQ=
endpoint: 146.185.93.113:5827
allowed ips: 10.42.3.0/24, 100.64.42.3/32
transfer: 0 B received, 11.45 MiB sent
persistent keepalive: every 20 seconds
peer: yk1mmgm0rY/f07nqUohZWIJyGUb5qZTSRYj7ZKyLmmk=
endpoint: 146.185.93.112:6016
allowed ips: 10.42.5.0/24, 100.64.42.5/32
transfer: 0 B received, 11.44 MiB sent
persistent keepalive: every 20 seconds
peer: 1r5LpRlRKrDME/HaMt4UsKZJkx7TKJctlu1g5shDjws=
endpoint: 146.185.93.119:5267
allowed ips: 10.42.6.0/24, 100.64.42.6/32
transfer: 0 B received, 11.45 MiB sent
persistent keepalive: every 20 seconds
peer: FlpS50NOa2UtI3U9LKBrCrx/8kDA2hhVXyZHTA99izQ=
endpoint: 146.185.93.120:7885
allowed ips: 10.42.7.0/24, 100.64.42.7/32
transfer: 0 B received, 11.45 MiB sent
persistent keepalive: every 20 seconds
~ # ip net exec n-MxMyNXzpwqJdb ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: n-MxMyNXzpwqJdb@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 36:5b:cc:49:23:23 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.42.4.1/24 brd 10.42.4.255 scope global n-MxMyNXzpwqJdb
valid_lft forever preferred_lft forever
inet6 fd4d:784d:794e:4::1/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::345b:ccff:fe49:2323/64 scope link
valid_lft forever preferred_lft forever
inet6 fe80::1/64 scope link
valid_lft forever preferred_lft forever
3: public@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 96:67:f2:9d:a2:15 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 100.127.0.3/16 brd 100.127.255.255 scope global public
valid_lft forever preferred_lft forever
inet6 fd00::3/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::9467:f2ff:fe9d:a215/64 scope link
valid_lft forever preferred_lft forever
20: w-MxMyNXzpwqJdb: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
link/none
inet 100.64.42.4/16 brd 100.64.255.255 scope global w-MxMyNXzpwqJdb
valid_lft forever preferred_lft forever
- tfchainmainpub0401 -> ssh [email protected] -p 34022 -A -> ssh [email protected]
- tfchainmainpub0402 -> ssh [email protected] -p 34022 -A -> ssh [email protected]
Here what I found:
- I choose node 312 to check why 311 can't establish a wire-guard handshake to that node.
- First I was trying to find the listening ports inside the
publicnamespace. Since node 312 has public config all wireguard endpoints then must point to this nodepublic-ip:<wg-port> - First thing I noticed was that the endpoint port
6085was not listening inside thepublicnamespace. - It crossed my mind that there might be something wrong with the network reservation contract. Hence I had to download the workloads state from the node and did some coding to read out the state and history of the workloads. According to this the workload for this network should be in
OKstate. Hence from the node perspective the deployment is fine. - The second question that came to my mind was, if the wg interface is there and the port is not listening inside the
publicnamespace then where is thatportlistening - a quick
ss -nlupon the host namespace shows this
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 0.0.0.0%zos:68 0.0.0.0:* users:(("udhcpc",pid=1247,fd=5))
UNCONN 0 0 0.0.0.0:4037 0.0.0.0:*
UNCONN 0 0 0.0.0.0:6085 0.0.0.0:*
UNCONN 0 0 [::]:4037 [::]:*
UNCONN 0 0 [::]:6085 [::]:*
So it seems that the port was listening on the host. (and not inside the namespace) this is why node 311 can't establish a wireguard connection to this node because on 311 it was trying to connect to the node public ip, not the node zos ip.
The question is now how did that happen! Why suddenly this port is listening on the host instead of the public namespacce?!
Hypothesis
- Public namespace (public config) was created after this network deployment was create. I think we can rule this out since node 311 is already using the public ip, and that was working before (according to the issue)
- Public Namesapce was deleted then recreated? Probably not possible. the namespace will not get recreated without a lot off issues.
- I tried this out and the port didn't suddenly appear on the host namespace, it wasn't possible to recreate the namespace without getting issues in networkd
- dmesg has no obvious errors
I will keep looking into it see if I can find something more.
@LeeSmet and @delandtj what do you think?
Duo to some intensive intervention to try to find out the issue, I had to reboot the node!
After the node is back to normal, and the wg port is listening inside the public namespace. the only problem was the br-pub was not wired correctly but there is another new PR to fix that.
Yeah this is consistent with some things I saw as well.
Public namespace (public config) was created after this network deployment was create. I think we can rule this out since node 311 is already using the public ip, and that was working before (according to the issue)
Public config was created a long time ago, (before node start), so this is for sure not the case.
Possible race condition in setting up the pub namespace?
No, the public namespace is created once the public config is set on the grid. I assume it was setup way before the deployment was created
Okay, stating the facts we have so far:
- Node public config (namespace) was created on the 16th of Jun. that's according to logs and time stamps on config files on the node itself.
- The contracts for this deployment seems all to have been created on the 15th of Jun. So it's obvious that those contracts happened before the node had it's public config "applied" to the node state.
- The only way the contracts actually use the node public config is that the
chainhad the node public config set. While for some reason this config is not actually applied on the node.
Now the million dollar question:
- Has this setup worked at some point, or was it broken from the start.
Note the following:
- Mainnet was updated on the 16th, this will force node services to restart hence networkd can find out that it needed to apply the public config.
- Unfortunately I can't track on the chain when actually the config was set. Hence we can never be sure about this.
Possible causes:
The node for some reason missed the grid events about setting up the node public config. Networkd was not restarted, hence didn't check if config has changed, until the next mainnet update (on the 16th). This will only be confirmed based on the answer of the million dollar question above. If it never worked then this is possibly the issue
Possible fixes:
Seems networkd need to check ALSO the grid regularly every say every 30min if it didn't get new events anyway. just in case events were missed out.
A better solution is to build a mechanism to never miss an event, but so far i can't find a good solution without getting every single block from the chain.
Seems the answer for the million dollar question is that the deployment was working fine so it's definitely not a missed event. ALTHOUGH it will be a good idea to not depend only on the event, and also check the grid like every 30 min for config changes.
Importnat
We need to test what happens when a NR is updated with a new wiregurard port. This will properly trigger an undefined behavior. IMHO once NR is created, it's WG port should not change during the entire life of that NR.
Possible scenario is NR was working fine, then new peers were added, but also new port was selected, then wg is suddently listening on the host. Need to be verified
Private networking seems to work again, thx!
Another different multi deployment via Terraform with a broken internal network setup. All nodes can reach each other internally but the public gateway node (which hosts the wireguard enpoint to connect your client) can't reach the vm's internally.
Node 422 in Farm 74
~ # ip net exec n-Uz2nombG2Dadi wg show
interface: w-Uz2nombG2Dadi
public key: XZgjeAlYmSmA05AyZFgxGiuWk9gMqy+DrbHsKmHthiQ=
private key: (hidden)
listening port: 7992
peer: 5VuEdqp1Ik+3Jm3R7iHLRJhpyL/YSj+ioIXz9BaM3AQ=
endpoint: 45.156.243.246:7169
allowed ips: 10.41.8.0/24, 100.64.41.8/32
latest handshake: 5 seconds ago
transfer: 1.14 MiB received, 7.96 MiB sent
persistent keepalive: every 20 seconds
peer: OduADrGOwdyThBKsltcpBXF6WzDq2Qj5lOxrVpsKKR0=
endpoint: 45.156.243.87:7997
allowed ips: 10.41.3.0/24, 100.64.41.3/32
transfer: 0 B received, 19.19 MiB sent
persistent keepalive: every 20 seconds
peer: Fzecedyo2G/bizWurNjv/hmoAxD8JcheKXM5m20/y14=
endpoint: 45.156.243.89:4001
allowed ips: 10.41.4.0/24, 100.64.41.4/32
transfer: 0 B received, 19.19 MiB sent
persistent keepalive: every 20 seconds
peer: nNqEUoLgtSzS7prLlz0u0yxKu9/58NUT20Wq7aF/y1k=
endpoint: 45.156.243.86:4394
allowed ips: 10.41.5.0/24, 100.64.41.5/32
transfer: 0 B received, 21.63 MiB sent
persistent keepalive: every 20 seconds
peer: /Kg/lI4RKbaVLt9d++rSj320gG9mOb/MxMTwNV/f5mk=
endpoint: 45.156.243.85:2418
allowed ips: 10.41.6.0/24, 100.64.41.6/32
transfer: 0 B received, 19.19 MiB sent
persistent keepalive: every 20 seconds
peer: z9hFo0M5U5MQwC/pG+HnCFIJ1pcH9sencHTThmsycCk=
endpoint: 45.156.243.244:4372
allowed ips: 10.41.7.0/24, 100.64.41.7/32
transfer: 0 B received, 19.19 MiB sent
persistent keepalive: every 20 seconds
peer: ObjC364c2fL64NMTfei3XWDfEKIW1dHb9yB2na6V/nk=
endpoint: 45.156.243.248:3240
allowed ips: 10.41.9.0/24, 100.64.41.9/32
transfer: 0 B received, 19.19 MiB sent
persistent keepalive: every 20 seconds
peer: nIR8HvZJMp2iXphWdsOtlJvQQtmA0ozAwP1SjKtFAVg=
allowed ips: 10.41.2.0/24, 100.64.41.2/32
persistent keepalive: every 20 seconds
Wireguard client config
[Interface]
Address = 100.64.41.2
PrivateKey = GMR3yWYptqjeRxxxxxxxxxxxxxxxxxxxxxxxxx=
[Peer]
PublicKey = XZgjeAlYmSmAxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=
AllowedIPs = 10.41.0.0/16, 100.64.0.0/16
PersistentKeepalive = 25
Endpoint = 45.156.243.242:7992
@coesensbert that was duo to nodes using the wrong nic for public traffic. this issue was solved in https://github.com/threefoldtech/zos/pull/1756 but not yet available on mainnet
Ok thanks @muhamadazmy !