libnetwork icon indicating copy to clipboard operation
libnetwork copied to clipboard

Predictive interface order / default gateway

Open cziebuhr opened this issue 7 years ago • 8 comments

I tried to setup a container with several networks, but didn't succeed in successfully configuring them. Especially when using ipv6 and trying to get a stable interface name of a macvlan network, I got mad. After looking at the docker sources, I stumbled across the following items:

a) I wonder about the usage of container/heap for sb.endpoints. IMHO heap is useful when using heap.Pop() for removing the respective minimum element. Instead, sb.Endpoints() and sb.getConnectedEndpoints() iterate directly over the sb.endpoints slice. In this case, only the first element is guaranteed to be the minimum element of the slice. All other elements have random order. In some setups, this leads to interface names inside the container having unpredictive order (https://github.com/moby/moby/issues/25181) and an unpredictive default-gateway. I created a patch to change from heap to an ordered array instead.

b) The eh.Less() function also produces unstable results, when e.g. several networks are internal. I'm unsure if checking for endpointInGWNetwork(), getNetwork().Internal() and joinInfo.gw is essential or if it was just convenience for most setups. I think we should skip processing these checks in eh.Less() and instead move them to sb.getGatewayEndpoint(). Also see #1609, #1443 and #1141.

c) There is already some code to set epPriority, unfortunately it's not yet exported through the api and it's still subject to additional auto-magic. I'd favor a solution where I can change the interface order as a user instead of having different results in each docker version.

cziebuhr avatar Feb 27 '18 17:02 cziebuhr

In my usecase, I need these networks:

  • default bridge with ipv4 and default gw
  • custom macvlan with ipv6 (for use with multicast, so I need to know the interface name)
  • optional additional networks

Result: When only having two networks, the macvlan becomes eth0 and default gw. When having more networks, it's totally unpredictable.

cziebuhr avatar Feb 28 '18 10:02 cziebuhr

Brilliant solution @cziebuhr , your patch worked for me. Do you think it's likely to get merged soon (or at all)?

networkop avatar Mar 02 '18 12:03 networkop

Created PR #2103 which at least solves point a)

cziebuhr avatar Mar 08 '18 10:03 cziebuhr

Hey. Sorry for the delayed response. Needed to do some background research as I wasn't familiar with this section of the code.

Overall, as stated, I think that the change in #2103 takes us in the right direction. From my query of those who know the history of this part of the code a bit better, the intent of the priority was definitely to allow the user to order the interfaces. Agree that keeping them in a heap works against that goal and so putting them in an ordered array as in the PR will fix that.

Now regarding part b) my reading of the code is that the sorting was stable, but perhaps not what we might want. The endpoints were effectively sorted into categories. Non-internal endpoints in the gateway networks come last preceeded by internal endpoints in gateway networks. Internal endpoints in non-gateway networks come precede those and finally the first category is non-internal endpoints in non-gateway networks. Within those categories, if both endpoints have join info, then interfaces with v6 info always come before those without. Finally, they are sorted by priority with ties being resolved by network name lexicographic order. So the question becomes whether this is the correct ordering.

Pictorally, this looks something like:

First: +----
       |   non Gateway, non Internal sorted by prio w/ more specific (i.e. v6) join info first
       |...
       +----
       |   Internal, non-Gateway sorted by prio w/ more specific (i.e. v6) join info first
       |...
       +----
       |   Gateway, non-Internal sorted by prio w/ more specific (i.e. v6) join info first
       |...
       +----
       |   Gateway, internal sorted by prio w/ more specific (i.e. v6) join info first
       |...
Last:  +----

Now we could make it so that priority is considered first so instead what we have is a set of lists like the following ordered by priority:

      +----
      |  non-GW, non-Internal, specific joinInfo
      |  non-GW, non-internal, less specific joinInfo
High  |  non-GW, Internal, specific joinInfo
Prio  |  non-GW, Internal, less specific joinInfo
      |  GW, non-Internal, specific joinInfo
      |....
      +----
Lower |...
Prio  |...

This would allow the user (when we get to part c)) to specify ordering but still have a set of criteria for breaking ties when priority is the same. Also, if priority isn't specified, the behavior should match what we have today.

W.r.t. c) Yes, we need to expose the priority in some way for the endpoints in order to allow the users to specify the ordering. This would require more investigation and obviously would require a separate PR in https://github.com/moby/moby. Suggestions / patches are by all means welcome.

ctelfer avatar Mar 25 '18 00:03 ctelfer

Am I right that there can only be one gateway network? If yes, then if epX.endpointInGWNetwork() { return X } is fine. But I assume there can be multiple internal networks. In that case, comparing two internal endpoints currently isn't stable, because one of them will return true or false before the next level of comparison is done. IMHO we need to change the fallthrough mechanism, to advance to the next level when both endpoints are considered equal on one level. So I suggest this algorithm:

# <=> Returns true if a < b, false if a > b and advances to next level if a == b
epi.prio <=> epj.prio           # 1 < 2
epi.gw <=> epj.gw               # non-gw < gw
epi.internal <=> epj.internal   # non-internal < internal
epi.joininfo <=> epj.joininfi   # ipv6 < ipv4
epi.name <=> epj.name           # bar < foo

cziebuhr avatar Mar 27 '18 15:03 cziebuhr

Yep, that's what I was trying to get at. Sorry if unclear. There is one difference though: priority comparison currently puts higher priority as lower in the comparison so higher priority interfaces come first:

func (eh epHeap) Less(i, j int) bool {
...
        return cip > cjp      // endpoint i < endpoint j if prio i > prio j
}

(annotation mine) I'd think we'd want to keep this. Although it's less numerically intuitive in the code, having higher priority endpoint interfaces come earlier in the listing would match my intuition.

ctelfer avatar Mar 27 '18 19:03 ctelfer

Regarding c) The PRs docker/cli#2529 and moby/moby#40974 seem related to this one, the latter requiring #2550 to compile.

I do wonder, whether this implementation would expose users to "auto-magic" of epPriority, mentioned in @cziebuhr's initial post.

aradmidias avatar Sep 01 '20 12:09 aradmidias

Any updates on this issue?

Vlad1mir-D avatar Apr 28 '23 12:04 Vlad1mir-D