Dual nic network setup is inconsistent
Sometimes a node booted with 2 nics attached will have its network setup in a single nic mode. This causes problems with public configs, and possibly public IP's set on workloads, as these environments also tend to have separate VLAN's for each nic in the switch. An example of this is node 320 on mainnet. There are also at least 2 other nodes in this farm which have this issue. Note that this node actually has 4 nics attached, 2 1gb for private data (i.e. nic for zos) and 2 10gb for public workload data. All nics are properly detected. There are other nodes in the farm with the same setup (4 nics) which do have 2 active nics (1 for zos 1 for pub traffic). Since it's extremely hard to identify this, see also #1752
I have a theory why this might happen. first time networkd starts after booting it will create the br-pub bridge (regardless it starts in single nic or dual nic modes) but hence it has to decide how this bridge is going to be wired to either the zos bridge or the 2nd nic.
The detection is based on some criteria as follows:
- it has to be a physical disk
- it has to be "wired"
- and nic must not be attached to anything else
If NO nic was found, br-pub with be wired to zos with via an veth pair. Here where it can go wrong. This setup is only done on first boot. No validation or re-wiring can happen after the machine is booted or after networkd is restarted.
Hence my theory is that one (or more) criteria wasn't met on boot. Cable wasn't plugged correctly? No slaac IP6 was received in time?
Note, this is just a theory but I can't confirm until i either have access to the node, or if we can reproduce locally. I will proceed to fix #1752 for now so we can "debug" easier.
Do we have nodes with similar issue in freefarm ?
There are 4 physical, wired nics, 3 of which are not attached. There is no IPv6 in the farm, other nodes are in a good setup, so Slaac does not seem to be the problem
Actually after reviewing the code it seems currently we filter ONLY on slaac enabled nics to use as exit for br-pub bridge. The thing is there was another bug in how we process the errors that caused to some "invalid" (according to the current filtering) to be selected.
The bug is now fixed (in code) which means that now ALL nodes in this farm will no longer have dual nic setup (at least the ones that will be rebooted) . unless we drop the filter for the requirements of IPv6.
The thing is, I can't remember why we had this ipv6 requirement during this procedure and I don't want the decision to remove this filter without first discussing this with You and @delandtj
Okay, after discussion with @delandtj here is what we come up with:
Requirements
- [x] Some nodes may contain public config, if exists this node can be used as a wireguard access point. This is a special case, where the generic case is that no public config specified.
- [x] All nodes always has
br-pubthis is bridge that is used by VMs that has public IPs.- [x] All nodes in a farm that has public IPs (available for rent) can host VMs that has public ips, even if the nodes themselves don't have public config. (again, having a public config is a special case)
- [x] the
br-pubbridge is always created on the node (in both single and dual setup) - [x] On boot, the node tries to find the best exit nic to use as a master for the
br-pubbridge as follows:- [x] First free, physical, nic that can get SLAAC ipv6 is used immediately
- [x] If no one is found, br-pub is wired to
zosbridge. hence node is fully hidden node. It will still accept public ipv4 workloads but not granted to work (up to the farmer network) since traffic will be routed over the private nic (zos) - [x] Farmer can still force br-pub to rewire to a different nic by using rmb calls, farmer should do that if the automatic detection of the exit nic wasn't successful (no Ipv6)
- [x] public config still (and must) be set via the chain because this information is needed by the grid users hence it has to be public.
Duo to a bug in the detection code even if there are NO valid nics (because of lack of ipv6) a nic will be selected, this is why some nodes in GE farm has managed to get dual nic setup while in fact they shouldn't. The current PR (should be merged asap imho) should be merged and deployed to main-net. This will force all nodes to have no dual setup
GE farm MUST provide SLAAP ipv6 on the public exit nic for the dual setup to work otherwise they will all be hidden nodes.
This PR fixes the bug with the detection, plus required changes to show state of dual setup on UI
https://github.com/threefoldtech/zos/pull/1756