public icon indicating copy to clipboard operation
public copied to clipboard

Add BGP cumulative link-bandwidth

Open dplore opened this issue 1 year ago • 8 comments

Change Scope

  • Extend configuration of BGP global and peer-groups to permit cumulative link bandwidth and transitive behavior as per draft-ietf-bess-ebgp-dmz.

  • Also add link-bandwidth parameters to BGP neighbor configuration and state. (Previously these were only defined at the global and peer-group levels)

  • This change only adds leafs and is backwards compatible.

  • Example flattened paths:

Prefix: /network-instances/network-instance/protocols/protocol/bgp/

bgp/global/afi-safis/afi-safi/link-bw/config/send-cumulative=true
bgp/global/afi-safis/afi-safi/link-bw/config/non-transitive-ebgp=true
bgp/global/afi-safis/afi-safi/link-bw/config/divide=false

Platform Implementations

On Cisco IOS XR, there is no special configuration needed other than link-bandwidth-ext-community/config/enabled . The OS implicitly sums up the link-bandwidth of ECMP bgp paths.

Arista EOS expects an 'aggregate' configuration option on top of enabling link-bandwidth to be added on a per neighbor basis. They further support 'divide' and 'equal' options.

JunOS expects configuration of a BGP policy statement to enable 'aggregate-bandwidth' and additional options for transitive/non-transitive and a 'divide-equal' option. There is also a configuration item, set per bgp neighbor or peer to allow link-bandwidth community to be sent via eBGP (ie: allow sending BGP link-bandwidth which was defined as non-transitive in draft-ietf-idr-link-bandwidth, but then updated by draft-ietf-bess-ebgp-dmz to be transitive.

Tree view

This common stanza of changes is added at each level of BGP configuration: bgp/global bgp/peer-group bgp/neighbor bgp/global/afi-safis bgp/peer-group/afi-safis bgp/neighbor/afi-safis

dloher$ diff -U 20 ~/master-tree.txt ~/bess-tree.txt
--- /Users/dloher/master-tree.txt       2024-08-16 18:03:37
+++ /Users/dloher/bess-tree.txt 2024-08-19 18:38:34
@@ -3733,40 +3733,49 @@
         |     |  |  |     |     +--rw link-bandwidth-ext-community
         |     |  |  |     |     |  +--rw config
         |     |  |  |     |     |  |  +--rw enabled?   boolean
         |     |  |  |     |     |  +--ro state
         |     |  |  |     |     |     +--ro enabled?   boolean
         |     |  |  |     |     +--rw config
         |     |  |  |     |     |  +--rw maximum-paths?   uint32
         |     |  |  |     |     +--ro state
         |     |  |  |     |        +--ro maximum-paths?   uint32
         |     |  |  |     +--rw add-paths
         |     |  |  |     |  +--rw config
         |     |  |  |     |  |  +--rw receive?                  boolean
         |     |  |  |     |  |  +--rw send?                     boolean
         |     |  |  |     |  |  +--rw send-max?                 uint8
         |     |  |  |     |  |  +--rw eligible-prefix-policy?   -> /oc-rpol:routing-policy/policy-definitions/policy-definition/name
         |     |  |  |     |  +--ro state
         |     |  |  |     |     +--ro receive?                  boolean
         |     |  |  |     |     +--ro send?                     boolean
         |     |  |  |     |     +--ro send-max?                 uint8
         |     |  |  |     |     +--ro eligible-prefix-policy?   -> /oc-rpol:routing-policy/policy-definitions/policy-definition/name
+        |     |  |  |     +--rw link-bw
+        |     |  |  |     |  +--rw config
+        |     |  |  |     |  |  +--rw send-cumulative?       boolean
+        |     |  |  |     |  |  +--rw non-transitive-ebgp?   boolean
+        |     |  |  |     |  |  +--rw divide?                boolean
+        |     |  |  |     |  +--ro state
+        |     |  |  |     |     +--ro send-cumulative?       boolean
+        |     |  |  |     |     +--ro non-transitive-ebgp?   boolean
+        |     |  |  |     |     +--ro divide?                boolean
         |     |  |  |     +--rw ipv4-unicast
         |     |  |  |     |  +--rw prefix-limit
         |     |  |  |     |  |  +--rw config
         |     |  |  |     |  |  |  +--rw max-prefixes?            uint32
         |     |  |  |     |  |  |  +--rw prevent-teardown?        boolean
         |     |  |  |     |  |  |  +--rw warning-threshold-pct?   oc-types:percentage

dplore avatar Jun 14 '24 01:06 dplore

No major YANG version changes in commit 09dadb3712378853d254d998aa911f81a8755c8e

OpenConfigBot avatar Jun 14 '24 01:06 OpenConfigBot

The draft-ietf-bess-ebgp-dmz does 2 things:

  1. allow for propagation link-bandwidth ext-community attribute form BGP Local-RIB to eBGP session despite link-bandwidth ext-community is non-transitiv. This is orthogonal to BGP multipathing configuration on given system.
  2. it also alow to apply different aggregation algorithms to link-bandwidth in case when multiple path with link-bandwidth ext-community exist in Local-RIB (that is BGP multipath is enabled) Note: The original I-D draft-ietf-idr-link-bandwidth-07 page 3, last sentence explicitly allow initialization and transmission of link-bandwidth community on eBGP session (similiarly to other non-transitive BGP attribute like MED). It just do not allow propagation of link-bandwidth community for Local-RIB to eBGP session.

The global/afi-safis/afi-safi/use-multiple-paths/... hierarchy is wrong place to control if and how link-bandwidth is propagated send over eBGP because:

  • draft-bess do not requires multipathing to be enabled (see 1 above)
  • global/afi-safis/afi-safi/use-multiple-paths/... is meant to provide constrains to proces of polulation fo Local-RIB and FIB's NHG. The use-multiple-paths/ebgp/link-bandwidth-ext-community/config/enabled control if FIB ECMP weights should be derived form link-bandwidth values or not. In later case basic ECMP will be programmed in dataplane cons=umig less TCAM/SRAM resources, but link-bandwidth will remain in Local-RIB and will be propagated to iBGP peers (and eBGP under draft-bess if enabled).
  • provided that .../afi-safis/afi-safi/use-multiple-paths/config/enabled is set to "FALSE", and new leaf non-transitive is set to "TRUE"; what shall be system behaviour?
  • IMHO the knobs controlling draft-ietf-bess-ebgp-dmz shall be direct attributes of {neighbor|peer-group|global}/afi-safis/afi-safi attribute, similar to send-community-type (which is unfortunetly enum). My suggestion is:
    • define 2 new container dedicated for link-badwidth community (it is special case anyway) under bgp/global/afi-safis/afi-safi/config/' - tx-link-bandwidthwith leafs:enabled, ebgpdefault FALSE, cummulative, average/equalAND rx-link-bandwidthwith leafs:enabledand default TRUE. iftx-link-bandwidth/enabled` not specified: TRUE on iBGP and FALSE on eBGP.

rszarecki avatar Jun 18 '24 15:06 rszarecki

JunOS expects configuration of a BGP policy statement to enable 'aggregate-bandwidth' and additional options for transitive/non-transitive and a 'divide-equal' option. There is also a configuration item, set per bgp neighbor or peer to allow link-bandwidth community to be sent via eBGP (ie: allow sending BGP link-bandwidth which was defined as non-transitive in draft-ietf-idr-link-bandwidth, but then updated by draft-ietf-bess-ebgp-dmz to be transitive.

The details here are slightly off. Juniper's implementation of link-bw uses a transitive extended community code point. (Embarrassing, because Juniper is the one that published the base spec that uses non-transitive in the document.) This causes interop issues between Juniper and implementations that don't understand the non-draft-compliant transitive format.

The link-bw draft is pending an upcoming update in IETF that will address both the transitive and non-transitive cases. It is intended to contain interop procedures. Indirectly, it also addresses some points covered in the DMZ draft.

jhaas-pfrc avatar Jun 18 '24 20:06 jhaas-pfrc

I will reiterate it again:

The 'global/afi-safis/afi-safi/use-multiple-paths/...' hierarchy is wrong place to control if and how link-bandwidth is propagated send over eBGP because:

  • draft-bess do not requires multipathing to be enabled (see 1 above) global/afi-safis/afi-safi/use-multiple-paths/... is meant to provide constrains to proces of polulation fo Local-RIB and FIB's NHG. The use-multiple-paths/ebgp/link-bandwidth-ext-community/config/enabled control if FIB ECMP weights should be derived form link-bandwidth values or not. In later case basic ECMP will be programmed in dataplane cons=umig less TCAM/SRAM resources, but link-bandwidth will remain in Local-RIB and will be propagated to iBGP peers (and eBGP under draft-bess if enabled). provided that .../afi-safis/afi-safi/use-multiple-paths/config/enabled is set to "FALSE", and new leaf non-transitive is set to "TRUE"; what shall be system behaviour?
  • IMHO the knobs controlling draft-ietf-bess-ebgp-dmz shall be direct attributes of {neighbor|peer-group|global}/afi-safis/afi-safi attribute, similar to send-community-type (which is unfortunetly enum). My suggestion is: define 2 new container dedicated for link-badwidth community (it is special case anyway) under bgp/global/afi-safis/afi-safi/config/' - tx-link-bandwidthwith leafs:enabled, ebgpdefault FALSE, cummulative, average/equalAND rx-link-bandwidthwith leafs:enabledand default TRUE. iftx-link-bandwidth/enabled` not specified: TRUE on iBGP and FALSE on eBGP.

rszarecki avatar Aug 20 '24 01:08 rszarecki

The 'global/afi-safis/afi-safi/use-multiple-paths/...' hierarchy is wrong place to control if and how link-bandwidth is propagated send over eBGP because:

Too fast on the comments! I agree and have moved it. Pushing commit here now. :)

Prefix: /network-instances/network-instance/protocols/protocol/bgp/

bgp/global/afi-safis/afi-safi/link-bw/config/send-cumulative=true
bgp/global/afi-safis/afi-safi/link-bw/config/non-transitive-ebgp=true
bgp/global/afi-safis/afi-safi/link-bw/config/divide=false

dplore avatar Aug 20 '24 01:08 dplore

Reviewed by OC operators on Aug 20, 2024 without objection.

dplore avatar Aug 20 '24 16:08 dplore

@rszarecki and @jhaas-pfrc thank you for the earlier review, this is ready for your review again.

dplore avatar Aug 22 '24 04:08 dplore

@rszarecki and @jhaas-pfrc thank you for the earlier review, this is ready for your review again.

I think in terms of the bess dmz and send non-transitive external being dealt with the current diff is fine. I suggest paying attention to upcoming last calls for draft-ietf-idr-link-bandwidth likely happening in the next month or so. I'll also be coordinating with the IETF bess chairs for their drafts for last call considerations.

With regard to the completeness of the work, the link-bandwidth extended community pattern probably needs some work. Right now I think vendor implementations are only starting to rendezvous in terms of consistency with regard to what link-bw we use. For example, Juniper added as part of this OC and the related IETF work the transitive version of the community. But similarly, other vendors have added non-transitive support.

This means there's now one note of ambiguity for the type. If it's the transitive form, document that. There's probably related work to create a type for the non-transitive form as well. That probably could happen in a separate pull request when the agenda for the OC implementers for link-bw have alignment.

jhaas-pfrc avatar Feb 19 '25 15:02 jhaas-pfrc