zos icon indicating copy to clipboard operation
zos copied to clipboard

Mainnet: lost internal networking (wg) on multi deployment.

Open coesensbert opened this issue 3 years ago • 12 comments

We have the following vm's running on mainnet Farm 77. They are all inside an internal subnet 10.42.0.0/16. https://gist.github.com/coesensbert/669f04271f898b1a4a59ea818cd32dd3

After a few days after deploying the tfchainmainpub040X nodes stopped connecting over the internal subnet. This is needed because one of the nodes is a caddy loadbalancer that exposes the other public tfchain nodes over the internal network.

This network is also deployed with a wireguard config, so our monitoring server connects via this vpn to monitor all the vm's in the internal subnet. All this communication still works. The Monitoring server can reach all vm's, but the vm's can't reach each other.

We did some checks on ZOS nodes 311 and 312 where it seems that wireguard stopped sending handshakes for some reason: 311:

~ # ip net
public
n-MxMyNXzpwqJdb (id: 0)
ndmz
~ # ip net exec n-MxMyNXzpwqJdb wg show
interface: w-MxMyNXzpwqJdb
  public key: hCDeyU3UaR6apCPZeJFLUL5DBDQNM4nAbqsbhQ5e/XQ=
  private key: (hidden)
  listening port: 5827

peer: G52pfjX0gPU8+EA7Oi+FqmHeql3ItCA2OW0ySXyHKiI=
  endpoint: 146.185.93.124:2759
  allowed ips: 10.42.8.0/24, 100.64.42.8/32
  latest handshake: 12 seconds ago
  transfer: 2.39 MiB received, 1.52 MiB sent
  persistent keepalive: every 20 seconds

peer: slJsHiV3fbRu+pq9hHXJEkclfI3dkk2PbgnsTjBHek4=
  endpoint: 146.185.93.123:2713
  allowed ips: 10.42.9.0/24, 100.64.42.9/32
  latest handshake: 49 seconds ago
  transfer: 760.66 KiB received, 1.09 MiB sent
  persistent keepalive: every 20 seconds

peer: vt29BpiJ47DTLL75w5ZpxBNoK5K9XUkGaQdcOXBpum8=
  endpoint: 146.185.93.125:7366
  allowed ips: 10.42.10.0/24, 100.64.42.10/32, 10.42.2.0/24, 100.64.42.2/32
  latest handshake: 1 minute, 21 seconds ago
  transfer: 29.91 MiB received, 377.69 MiB sent
  persistent keepalive: every 20 seconds

peer: cxJndByOdEvW6wtweLblPbbOb2Zm+2ETPU2dNkv9d2U=
  endpoint: 146.185.93.114:6085
  allowed ips: 10.42.4.0/24, 100.64.42.4/32
  transfer: 0 B received, 11.50 MiB sent
  persistent keepalive: every 20 seconds

peer: yk1mmgm0rY/f07nqUohZWIJyGUb5qZTSRYj7ZKyLmmk=
  endpoint: 146.185.93.112:6016
  allowed ips: 10.42.5.0/24, 100.64.42.5/32
  transfer: 0 B received, 11.44 MiB sent
  persistent keepalive: every 20 seconds

peer: 1r5LpRlRKrDME/HaMt4UsKZJkx7TKJctlu1g5shDjws=
  endpoint: 146.185.93.119:5267
  allowed ips: 10.42.6.0/24, 100.64.42.6/32
  transfer: 0 B received, 11.45 MiB sent
  persistent keepalive: every 20 seconds

peer: FlpS50NOa2UtI3U9LKBrCrx/8kDA2hhVXyZHTA99izQ=
  endpoint: 146.185.93.120:7885
  allowed ips: 10.42.7.0/24, 100.64.42.7/32
  transfer: 0 B received, 11.45 MiB sent
  persistent keepalive: every 20 seconds

~ # ip net exec n-MxMyNXzpwqJdb ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: n-MxMyNXzpwqJdb@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 0e:7b:52:4c:90:d7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.3.1/24 brd 10.42.3.255 scope global n-MxMyNXzpwqJdb
       valid_lft forever preferred_lft forever
    inet6 fd4d:784d:794e:3::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::c7b:52ff:fe4c:90d7/64 scope link 
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link 
       valid_lft forever preferred_lft forever
3: public@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 82:05:d4:d1:ea:1a brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 100.127.0.6/16 brd 100.127.255.255 scope global public
       valid_lft forever preferred_lft forever
    inet6 fd00::6/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::8005:d4ff:fed1:ea1a/64 scope link 
       valid_lft forever preferred_lft forever
28: w-MxMyNXzpwqJdb: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 100.64.42.3/16 brd 100.64.255.255 scope global w-MxMyNXzpwqJdb
       valid_lft forever preferred_lft forever

312:

~ # ip net
n-WeFyBhsM1KxVT (id: 2)
public
n-MxMyNXzpwqJdb (id: 1)
n-JL8VRacJ2aum9 (id: 0)
ndmz
~ # ip net exec n-MxMyNXzpwqJdb wg show
interface: w-MxMyNXzpwqJdb
  public key: cxJndByOdEvW6wtweLblPbbOb2Zm+2ETPU2dNkv9d2U=
  private key: (hidden)
  listening port: 6085

peer: G52pfjX0gPU8+EA7Oi+FqmHeql3ItCA2OW0ySXyHKiI=
  endpoint: 146.185.93.124:2759
  allowed ips: 10.42.8.0/24, 100.64.42.8/32
  latest handshake: 44 seconds ago
  transfer: 760.60 KiB received, 1.09 MiB sent
  persistent keepalive: every 20 seconds

peer: vt29BpiJ47DTLL75w5ZpxBNoK5K9XUkGaQdcOXBpum8=
  endpoint: 146.185.93.125:7366
  allowed ips: 10.42.10.0/24, 100.64.42.10/32, 10.42.2.0/24, 100.64.42.2/32
  latest handshake: 55 seconds ago
  transfer: 29.43 MiB received, 378.47 MiB sent
  persistent keepalive: every 20 seconds

peer: slJsHiV3fbRu+pq9hHXJEkclfI3dkk2PbgnsTjBHek4=
  endpoint: 146.185.93.123:2713
  allowed ips: 10.42.9.0/24, 100.64.42.9/32
  latest handshake: 1 minute, 32 seconds ago
  transfer: 5.72 GiB received, 10.78 GiB sent
  persistent keepalive: every 20 seconds

peer: hCDeyU3UaR6apCPZeJFLUL5DBDQNM4nAbqsbhQ5e/XQ=
  endpoint: 146.185.93.113:5827
  allowed ips: 10.42.3.0/24, 100.64.42.3/32
  transfer: 0 B received, 11.45 MiB sent
  persistent keepalive: every 20 seconds

peer: yk1mmgm0rY/f07nqUohZWIJyGUb5qZTSRYj7ZKyLmmk=
  endpoint: 146.185.93.112:6016
  allowed ips: 10.42.5.0/24, 100.64.42.5/32
  transfer: 0 B received, 11.44 MiB sent
  persistent keepalive: every 20 seconds

peer: 1r5LpRlRKrDME/HaMt4UsKZJkx7TKJctlu1g5shDjws=
  endpoint: 146.185.93.119:5267
  allowed ips: 10.42.6.0/24, 100.64.42.6/32
  transfer: 0 B received, 11.45 MiB sent
  persistent keepalive: every 20 seconds

peer: FlpS50NOa2UtI3U9LKBrCrx/8kDA2hhVXyZHTA99izQ=
  endpoint: 146.185.93.120:7885
  allowed ips: 10.42.7.0/24, 100.64.42.7/32
  transfer: 0 B received, 11.45 MiB sent
  persistent keepalive: every 20 seconds
~ # ip net exec n-MxMyNXzpwqJdb ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: n-MxMyNXzpwqJdb@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 36:5b:cc:49:23:23 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.4.1/24 brd 10.42.4.255 scope global n-MxMyNXzpwqJdb
       valid_lft forever preferred_lft forever
    inet6 fd4d:784d:794e:4::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::345b:ccff:fe49:2323/64 scope link 
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link 
       valid_lft forever preferred_lft forever
3: public@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 96:67:f2:9d:a2:15 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 100.127.0.3/16 brd 100.127.255.255 scope global public
       valid_lft forever preferred_lft forever
    inet6 fd00::3/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::9467:f2ff:fe9d:a215/64 scope link 
       valid_lft forever preferred_lft forever
20: w-MxMyNXzpwqJdb: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 100.64.42.4/16 brd 100.64.255.255 scope global w-MxMyNXzpwqJdb
       valid_lft forever preferred_lft forever

coesensbert avatar Jun 28 '22 11:06 coesensbert

Here what I found:

  • I choose node 312 to check why 311 can't establish a wire-guard handshake to that node.
  • First I was trying to find the listening ports inside the public namespace. Since node 312 has public config all wireguard endpoints then must point to this node public-ip:<wg-port>
  • First thing I noticed was that the endpoint port 6085 was not listening inside the public namespace.
  • It crossed my mind that there might be something wrong with the network reservation contract. Hence I had to download the workloads state from the node and did some coding to read out the state and history of the workloads. According to this the workload for this network should be in OK state. Hence from the node perspective the deployment is fine.
  • The second question that came to my mind was, if the wg interface is there and the port is not listening inside the public namespace then where is that port listening
  • a quick ss -nlup on the host namespace shows this
State     Recv-Q    Send-Q       Local Address:Port        Peer Address:Port    Process                              
UNCONN    0         0              0.0.0.0%zos:68               0.0.0.0:*        users:(("udhcpc",pid=1247,fd=5))    
UNCONN    0         0                  0.0.0.0:4037             0.0.0.0:*                                            
UNCONN    0         0                  0.0.0.0:6085             0.0.0.0:*                                            
UNCONN    0         0                     [::]:4037                [::]:*                                            
UNCONN    0         0                     [::]:6085                [::]:*  

So it seems that the port was listening on the host. (and not inside the namespace) this is why node 311 can't establish a wireguard connection to this node because on 311 it was trying to connect to the node public ip, not the node zos ip.

The question is now how did that happen! Why suddenly this port is listening on the host instead of the public namespacce?!

Hypothesis

  • Public namespace (public config) was created after this network deployment was create. I think we can rule this out since node 311 is already using the public ip, and that was working before (according to the issue)
  • Public Namesapce was deleted then recreated? Probably not possible. the namespace will not get recreated without a lot off issues.
    • I tried this out and the port didn't suddenly appear on the host namespace, it wasn't possible to recreate the namespace without getting issues in networkd
  • dmesg has no obvious errors

I will keep looking into it see if I can find something more.

@LeeSmet and @delandtj what do you think?

muhamadazmy avatar Jun 30 '22 11:06 muhamadazmy

Duo to some intensive intervention to try to find out the issue, I had to reboot the node!

muhamadazmy avatar Jun 30 '22 11:06 muhamadazmy

After the node is back to normal, and the wg port is listening inside the public namespace. the only problem was the br-pub was not wired correctly but there is another new PR to fix that.

muhamadazmy avatar Jun 30 '22 11:06 muhamadazmy

Yeah this is consistent with some things I saw as well.

Public namespace (public config) was created after this network deployment was create. I think we can rule this out since node 311 is already using the public ip, and that was working before (according to the issue)

Public config was created a long time ago, (before node start), so this is for sure not the case.

LeeSmet avatar Jun 30 '22 11:06 LeeSmet

Possible race condition in setting up the pub namespace?

LeeSmet avatar Jun 30 '22 11:06 LeeSmet

No, the public namespace is created once the public config is set on the grid. I assume it was setup way before the deployment was created

muhamadazmy avatar Jun 30 '22 12:06 muhamadazmy

Okay, stating the facts we have so far:

  • Node public config (namespace) was created on the 16th of Jun. that's according to logs and time stamps on config files on the node itself.
  • The contracts for this deployment seems all to have been created on the 15th of Jun. So it's obvious that those contracts happened before the node had it's public config "applied" to the node state.
  • The only way the contracts actually use the node public config is that the chain had the node public config set. While for some reason this config is not actually applied on the node.

Now the million dollar question:

  • Has this setup worked at some point, or was it broken from the start.

Note the following:

  • Mainnet was updated on the 16th, this will force node services to restart hence networkd can find out that it needed to apply the public config.
  • Unfortunately I can't track on the chain when actually the config was set. Hence we can never be sure about this.

Possible causes:

The node for some reason missed the grid events about setting up the node public config. Networkd was not restarted, hence didn't check if config has changed, until the next mainnet update (on the 16th). This will only be confirmed based on the answer of the million dollar question above. If it never worked then this is possibly the issue

Possible fixes:

Seems networkd need to check ALSO the grid regularly every say every 30min if it didn't get new events anyway. just in case events were missed out.

A better solution is to build a mechanism to never miss an event, but so far i can't find a good solution without getting every single block from the chain.

muhamadazmy avatar Jun 30 '22 13:06 muhamadazmy

Seems the answer for the million dollar question is that the deployment was working fine so it's definitely not a missed event. ALTHOUGH it will be a good idea to not depend only on the event, and also check the grid like every 30 min for config changes.

Importnat

We need to test what happens when a NR is updated with a new wiregurard port. This will properly trigger an undefined behavior. IMHO once NR is created, it's WG port should not change during the entire life of that NR.

Possible scenario is NR was working fine, then new peers were added, but also new port was selected, then wg is suddently listening on the host. Need to be verified

muhamadazmy avatar Jun 30 '22 14:06 muhamadazmy

Private networking seems to work again, thx!

coesensbert avatar Jul 05 '22 14:07 coesensbert

Another different multi deployment via Terraform with a broken internal network setup. All nodes can reach each other internally but the public gateway node (which hosts the wireguard enpoint to connect your client) can't reach the vm's internally.

Node 422 in Farm 74

~ # ip net exec n-Uz2nombG2Dadi wg show
interface: w-Uz2nombG2Dadi
  public key: XZgjeAlYmSmA05AyZFgxGiuWk9gMqy+DrbHsKmHthiQ=
  private key: (hidden)
  listening port: 7992

peer: 5VuEdqp1Ik+3Jm3R7iHLRJhpyL/YSj+ioIXz9BaM3AQ=
  endpoint: 45.156.243.246:7169
  allowed ips: 10.41.8.0/24, 100.64.41.8/32
  latest handshake: 5 seconds ago
  transfer: 1.14 MiB received, 7.96 MiB sent
  persistent keepalive: every 20 seconds

peer: OduADrGOwdyThBKsltcpBXF6WzDq2Qj5lOxrVpsKKR0=
  endpoint: 45.156.243.87:7997
  allowed ips: 10.41.3.0/24, 100.64.41.3/32
  transfer: 0 B received, 19.19 MiB sent
  persistent keepalive: every 20 seconds

peer: Fzecedyo2G/bizWurNjv/hmoAxD8JcheKXM5m20/y14=
  endpoint: 45.156.243.89:4001
  allowed ips: 10.41.4.0/24, 100.64.41.4/32
  transfer: 0 B received, 19.19 MiB sent
  persistent keepalive: every 20 seconds

peer: nNqEUoLgtSzS7prLlz0u0yxKu9/58NUT20Wq7aF/y1k=
  endpoint: 45.156.243.86:4394
  allowed ips: 10.41.5.0/24, 100.64.41.5/32
  transfer: 0 B received, 21.63 MiB sent
  persistent keepalive: every 20 seconds

peer: /Kg/lI4RKbaVLt9d++rSj320gG9mOb/MxMTwNV/f5mk=
  endpoint: 45.156.243.85:2418
  allowed ips: 10.41.6.0/24, 100.64.41.6/32
  transfer: 0 B received, 19.19 MiB sent
  persistent keepalive: every 20 seconds

peer: z9hFo0M5U5MQwC/pG+HnCFIJ1pcH9sencHTThmsycCk=
  endpoint: 45.156.243.244:4372
  allowed ips: 10.41.7.0/24, 100.64.41.7/32
  transfer: 0 B received, 19.19 MiB sent
  persistent keepalive: every 20 seconds

peer: ObjC364c2fL64NMTfei3XWDfEKIW1dHb9yB2na6V/nk=
  endpoint: 45.156.243.248:3240
  allowed ips: 10.41.9.0/24, 100.64.41.9/32
  transfer: 0 B received, 19.19 MiB sent
  persistent keepalive: every 20 seconds

peer: nIR8HvZJMp2iXphWdsOtlJvQQtmA0ozAwP1SjKtFAVg=
  allowed ips: 10.41.2.0/24, 100.64.41.2/32
  persistent keepalive: every 20 seconds

Wireguard client config

    [Interface]
    Address = 100.64.41.2
    PrivateKey = GMR3yWYptqjeRxxxxxxxxxxxxxxxxxxxxxxxxx=
    [Peer]
    PublicKey = XZgjeAlYmSmAxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=
    AllowedIPs = 10.41.0.0/16, 100.64.0.0/16
    PersistentKeepalive = 25
    Endpoint = 45.156.243.242:7992

coesensbert avatar Jul 07 '22 12:07 coesensbert

@coesensbert that was duo to nodes using the wrong nic for public traffic. this issue was solved in https://github.com/threefoldtech/zos/pull/1756 but not yet available on mainnet

muhamadazmy avatar Jul 07 '22 13:07 muhamadazmy

Ok thanks @muhamadazmy !

coesensbert avatar Jul 07 '22 13:07 coesensbert