zos Completely forego the usage of macvlans, as they wreak havoc in ARP/IPv6 ND when more than one macvlan is created form a bridge interface

Creating a macvlan from a bridge interface seems to conflict in arp/nd for the bridge, as as well the bridge(duh) as macvlans have bridging code and somewhere along the path things don't fit any more. Creating macvlans from physical interfaces doesn't seem to create problems, but from bridge interfaces (when there is more than one and both interfaces need to communicate) it does weird things, where neighbors get lost, asn as such routes get lost, ans a such, connections get lost.

some comments needed here :

if a macvlan or macvtap is created on a bridge for a workload that is still alive, we will have to see if we want to leave it like that or recreate it. Or just tell the user to recreate his workload.
zdbs are using the old container code, so I guess we can do that 'quite' easily?
all other workloads are taps ?

Aug 21 '24 09:08 delandtj

There might be some info about the sources of conflict here (just a note for reference).

So far I understand this affects the Mycelium interface for Zdbs. Are VMs also affected? If I read correctly here it seems so:

https://github.com/threefoldtech/zos/blob/9cd81f3ec8049a224b3c819e3b329a28835ae92c/docs/internals/network/topology/readme.md?plain=1#L84

we will have to see if we want to leave it like that or recreate it. Or just tell the user to recreate his workload.

I think it's fine to leave existing VMs as is and let the users recreate if they want. If the user is experience a problem due to this, they probably have found another communication route and are happy enough with their VM as is.

For Zdb, ideally the endpoints could retain all the same IP addresses and nothing changes for the users. If that's not possible, oh well. There's not much usage yet anyway.

Aug 21 '24 22:08 scottyeager

VMs always use regular TAP interfaces (NOT macvtaps), and therefore these should be fine. The problem exists primarily for zdbs because (for mycelium), a packet traverses 2 macvlans in bridge mode which are themselves enslaved by a bridge. It appears to be specifically the second macvlan which causes issues when acting as router, since "regular" public ipv6 has the same setup (macvlan from br-pub to zdb ns and macvlan from br-pub to public ns), however here the packet "exits" on the bridge since the bridge enslaved a physical interface which is the path for the router (so for regular IPv6 the public namespace is not involved).

Aug 22 '24 12:08 LeeSmet

this is already handled for zos4 we need to do it in zos3

Sep 18 '24 09:09 ashraffouda

Any status on zos3 for this ? Keep in mind that when that is done (in zos4 and zos3) we can remove the dumdum devices, as that problem is gone too then

Sep 25 '24 12:09 delandtj

Hi everyone,

I had a talk with @despiegk and we can't wait until 3.16 to fix this macvlan bug, as it is currently a blocking bug, i.e. it blocks Mycelium utilization. Thus it is now set as a critical priority.

We can't wait for a forklift upgrade. We just need to do a patch.

@ashraffouda @Omarabdul3ziz can you guys check this? @xmonader tagging you here so you can coordinate the steps if needed.

Thanks everyone.

Oct 07 '24 16:10 mik-tf

changing to macvlans will not fix the problem until every single node is rebooted, because nodes will not optout macvlans until rebooted

Oct 08 '24 08:10 ashraffouda

So basically all macvlan devices attached to a bridge (which should be all at this point iirc) need to be replaced by a veth pair with one end plugged into the bridge. While it is true that the network won't adapt automatically just because new code is uploaded until the node is rebooted, this can be solved by writing a bit of migration code which looks for all macvlans and switches them for veth pairs. Or the migration code can look for a single known macvlan device, if that is found it knows other macvlans need to be switched out, and finally the known one should be chagned to a veth pair. This will allow the devices to be changed with minimal downtime for workloads and without the need to reboot the node. Finally, the dummy interfaces we use to force the bridges to be up can be cleaned up as well after this.

Oct 08 '24 09:10 LeeSmet

As I understand, this means that @LeeSmet has a working solution for this.

@ashraffouda @Omarabdul3ziz Can you implement the changes needed? If something is missing, what would be needed to fix this ASAP? Thanks.

Oct 08 '24 15:10 mik-tf

@xmonader, I added back priority_major as @delandtj set it initially, if we don't want to set it as critical.

Please let's discuss and act as needed to move this forward. I'll be glad to help.

Oct 09 '24 13:10 mik-tf

we started by changing the zdb part today, but still not finished yet. has some issues we still debugging it

Oct 09 '24 13:10 ashraffouda

Work is done on the zdb workload it now uses veth pair instead of the macvlan for all its links (ygg/myc/pub)

will continue working on the other macvlans used for the network resource and the wiring to ndmz namespace

Oct 10 '24 14:10 Omarabdul3ziz

Since the patch to implement this was applied to mainnet, I don't have any more issues connecting to Zdbs over Mycelium. Looks good.

Oct 24 '24 14:10 scottyeager

Since the patch to implement this was applied to mainnet, I don't have any more issues connecting to Zdbs over Mycelium. Looks good.

Right now the macvlan removal was only applied on devnet, so that may mean there's something else could have been going on

Oct 24 '24 14:10 xmonader

Ah, my mistake. I should refine my statement to be that, while the gratuitous packet loss did not appear in my recent tests, there are still big latency spikes on testnet.

However, my testing on devnet today shows that there is still large amounts of packet loss happening when connecting to Zdbs over Mycelium.

Here are some examples from ping tests. These were performed from a VM running on node 128 on devnet. The IPs are the Mycelium IPs of Zdbs running on the indicated node ids, as returned by Zos after creating Zdb deployments. Check the gaps in sequence number.

Devnet node 12

64 bytes from 5b2:39e5:4286:5af0:8301:6cb5:43f0:5b77: icmp_seq=128 ttl=58 time=64.6 ms
64 bytes from 5b2:39e5:4286:5af0:8301:6cb5:43f0:5b77: icmp_seq=153 ttl=59 time=48.1 ms

Devnet node 28

64 bytes from 53c:6017:e9c1:9aee:e514:bea6:1a64:d38: icmp_seq=253 ttl=59 time=44.6 ms
64 bytes from 53c:6017:e9c1:9aee:e514:bea6:1a64:d38: icmp_seq=310 ttl=59 time=132 ms

Devnet node 31

64 bytes from 4b6:c798:a:34e1:6a16:7067:d299:11b: icmp_seq=186 ttl=58 time=59.1 ms
64 bytes from 4b6:c798:a:34e1:6a16:7067:d299:11b: icmp_seq=269 ttl=58 time=46.3 ms

Looks like we need some further investigation to be sure of the root cause and why the proposed fix did not address it.

Oct 24 '24 15:10 scottyeager

Node 12:

Node 28:

Node 31:

Looks like we need some further investigation to be sure of the root cause and why the proposed fix did not address it.

I concur, since it seems that all listed nodes still exhibit the old behavior of using macvlans

Oct 25 '24 08:10 LeeSmet

@scottyeager can you try after that reboot? the migration code is still in progress to automigrate without the need for a reboot.

Oct 28 '24 10:10 xmonader

Looking much better now. The three nodes that were rebooted have 0% packet loss now. Node 14 which was not rebooted is still showing high packet loss (lower right):

Oct 28 '24 23:10 scottyeager

when creating the zdb namespace disable forwarding IN the namespace. when creting a namespace it inherits the settings of namespace 0, and by default enables forwarding SLAAC does not accept RA when forwarding is enabled, hence no ipv6 on that interface.

one could set IN the namespace accept_ra to 2, but that is not a correct solution

Jan 16 '25 15:01 delandtj

also, when defining a 0-db namespace for the daemon to run, the interface is called eth0 instead of the standard naming schemes where name collisions are avoided.. I think that p-eth0 and eth0 in the namespace will have name collisions when one would run more than one 0-db, if it's supposed that every 0-db runs in it's own namespace (I myself don't know if that is the case)

Jan 17 '25 09:01 delandtj