1 ================================
2 ECMP Support for BGP based L3VPN
3 ================================
5 https://git.opendaylight.org/gerrit/#/q/topic:l3vpn_ecmp
7 This Feature is needed for load balancing of traffic in a cloud and also
8 redundancy of paths for resiliency in cloud.
13 The current L3VPN implementation for BGP VPN doesn't support load balancing
14 behavior for external routes through multiple DC-GWs and reaching starting
15 route behind Nova VMs through multiple compute nodes.
17 This spec provides implementation details about providing traffic load
18 balancing using ECMP for L3 routing and forwarding. The load balancing of
19 traffic can be across virtual machines with each connected to the different
20 compute nodes, DC-Gateways. ECMP also enables fast failover of traffic
21 The ECMP forwarding is required for both inter-DC and intra-DC data traffic
22 types. For inter-DC traffic, spraying from DC-GW to compute nodes & VMs for
23 the traffic entering DC and spraying from compute node to DC-GWs for the
24 traffic exiting DC is needed. For intra-DC traffic, spraying of traffic
25 within DC across multiple compute nodes & VMs is needed. There should be
26 tunnel monitoring (e.g. GRE-KA or BFD) logic implemented to monitor DC-GW
27 /compute node GRE tunnels which helps to determine available ECMP paths to
33 UC1: ECMP forwarding of traffic entering a DC (i.e. Spraying of
34 DC-GW -> OVS traffic across multiple Compute Nodes & VMs).
35 In this case, DC-GW can load balance the traffic if a static route can be reachable
36 through multiple NOVA VMs (say VM1 and VM2 connected on different compute nodes)
37 running some networking application (example: vRouter).
39 UC2: ECMP forwarding of traffic exiting a DC (i.e. Spraying of
40 OVS -> DC-GW traffic across multiple DC Gateways).
41 In this case, a Compute Node can LB the traffic if external route can be
42 reachable from multiple DC-GWs.
44 UC3: ECMP forwarding of intra-DC traffic (i.e. Spraying of traffic within DC
45 across multiple Compute Nodes & VMs)
46 This is similar to UC1, but load balancing behavior is applied on remote Compute
47 Node for intra-DC communication.
49 UC4: OVS -> DC-GW tunnel status based ECMP for inter and intra-DC traffic.
50 Tunnel status based on monitoring (BFD) is considered in ECMP path set determination.
53 High-Level Components:
54 ======================
56 The following components of the Openstack - ODL solution need to be enhanced to provide
58 * Openstack Neutron BGPVPN Driver (for supporting multiple RDs)
59 * OpenDaylight Controller (NetVirt VpnService)
60 We will review enhancements that will be made to each of the above components in following
66 The following components within OpenDaylight Controller needs to be enhanced:
68 * VPN Engine (VPN Manager and VPN Interface Manager)
74 Local FIB entry/Nexthop Group programming:
75 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
76 A static route (example: 100.0.0.0/24) reachable through two VMs connected
77 with same compute node.
79 cookie=0x8000003, duration=46.020s, table=21, n_packets=0, n_bytes=0, priority=34,ip,metadata=0x222e4/0xfffffffe, nw_dst=100.0.0.0/24 actions=write_actions(group:150002)
80 group_id=150002,type=select,bucket=weight:50,actions=group:150001,bucket=weight:50,actions=group:150000
81 group_id=150001,type=all,bucket=actions=set_field:fa:16:3e:34:ff:58->eth_dst,load:0x200->NXM_NX_REG6[],resubmit(,220)
82 group_id=150000,type=all,bucket=actions=set_field:fa:16:3e:eb:61:39->eth_dst,load:0x100->NXM_NX_REG6[],resubmit(,220)
84 Table 0=>Table 17=>Table 19=>Table 21=>LB Group=>Local VM Group=>Table 220
86 Remote FIB entry/Nexthop Group programming:
87 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
88 1) A static route (example: 10.0.0.1/32) reachable through two VMs connected
89 with different compute node.
91 on remote compute node,
93 cookie=0x8000003, duration=46.020s, table=21, n_packets=0, n_bytes=0, priority=34,ip,metadata=0x222e4/0xfffffffe, nw_dst=10.0.0.1 actions=set_field:0xEF->tun_id, group:150003
94 group_id=150003,type=select,bucket=weight:50,actions=output:1,bucket=weight:50,actions=output:2
96 Table 0=>Table 17=>Table 19=>Table 21=>LB Group=>VxLAN port
98 on local compute node,
100 cookie=0x8000003, duration=46.020s, table=21, n_packets=0, n_bytes=0, priority=34,ip,metadata=0x222e4/0xfffffffe, nw_dst=10.0.0.1 actions=group:150003
101 group_id=150003,type=select,bucket=weight:50,group=150001,bucket=weight:50,actions=set_field:0xEF->tun_id, output:2
102 group_id=150001,type=all,bucket=actions=set_field:fa:16:3e:34:ff:58->eth_dst,load:0x200->NXM_NX_REG6[],resubmit(,220)
104 Table 0=>Table 17=>Table 19=>Table 21=>LB Group=>Local VM Group=>Table 220
107 2) An external route (example: 20.0.0.1/32) reachable through two DC-GWs.
109 cookie=0x8000003, duration=13.044s, table=21, n_packets=0, n_bytes=0,priority=42,ip,metadata=0x222ec/0xfffffffe,nw_dst=20.0.0.1 actions=load:0x64->NXM_NX_REG0[0..19],load:0xc8->NXM_NX_REG1[0..19],group:150111
110 group_id=150111,type=select,bucket=weight:50,actions=push_mpls:0x8847, move:NXM_NX_REG0[0..19]->OXM_OF_MPLS_LABEL[],output:3, bucket=weight:50,actions=push_mpls:0x8847,move:NXM_NX_REG1[0..19]->OXM_OF_MPLS_LABEL[],output:4
112 Table 0=>Table 17=>Table 19=>Table 21=>LB Group=>GRE port
116 Changes will be needed in ``l3vpn.yang`` , ``odl-l3vpn.yang`` and ``odl-fib.yang``
117 to support ECMP functionality.
121 route-distinguisher type is changed from leaf to leaf-list in vpn-af-config
122 grouping in l3vpn.yang.
124 grouping vpn-af-config {
126 "A set of configuration parameters that is applicable to both IPv4 and
127 IPv6 address family for a VPN instance .";
129 leaf-list route-distinguisher {
131 "The route-distinguisher command configures a route distinguisher (RD)
132 for the IPv4 or IPv6 address family of a VPN instance.
134 Format is ASN:nn or IP-address:nn.";
142 ODL-L3VPN YANG changes
143 ^^^^^^^^^^^^^^^^^^^^^^
144 Add vrf-id (RD) in adjacency list in odl-l3vpn.yang.
146 grouping adjacency-list {
149 leaf-list next-hop-ip-list { type string; }
150 leaf ip_address {type string;}
151 leaf primary-adjacency {
155 "Value of True indicates this is a primary adjacency";
157 leaf label { type uint32; config "false"; } /* optional */
158 leaf mac_address {type string;} /* optional */
159 leaf vrf-id {type string;}
163 vpn-to-extraroute have to be updated with multiple RDs (vrf-id) when extra route from VMs
164 connected with different compute node and when connected on same compute node, just use
165 same RD and update nexthop-ip-list with new VM IP address like below.
167 container vpn-to-extraroutes {
169 list vpn-extraroutes {
178 "The vrf-id command configures a route distinguisher (RD) for the IPv4
179 or IPv6 address family of a VPN instance or vpn instance name for
185 leaf prefix {type string;}
186 leaf-list nexthop-ip-list {
194 To manage RDs for extra with multiple next hops, the following yang
195 model is required to advertise (or) withdraw the extra routes with
196 unique NLRI accordingly.
198 container extraroute-routedistinguishers-map {
200 list extraroute-routedistingueshers {
211 leaf-list route-distinguishers {
220 When Quagga BGP announces route with multiple paths, then it is ODL responsibility
221 to program Fib entries in all compute nodes where VPN instance blueprint is present,
222 so that traffic can be load balanced between these two DC gateways. It requires
223 changes in existing odl-fib.yang model (like below) to support multiple
224 routes for same destination IP prefix.
226 grouping vrfEntries {
238 key "nexthop-address";
239 leaf nexthop-address {
250 New YANG model to update load balancing next hop group buckets according
251 to VxLAN/GRE tunnel status [Note that these changes are required only if
252 watch_port in group bucket is not working based on tunnel port liveness
253 monitoring affected by the BFD status]. When one of the VxLAN/GRE tunnel
254 is going down, then retrieve nexthop-key from dpid-l3vpn-lb-nexthops by
255 providing tep-device-id’s from src-info and dst-info of StateTunnelList
256 while handling its update DCN. After retrieving next hop key, fetch
257 target-device-id list from l3vpn-lb-nexthops and reprogram
258 VxLAN/GRE load balancing group in each remote Compute Node based
259 on tunnel state between source and destination Compute Node. Similarly,
260 when tunnel comes up, then logic have to be rerun to add its
261 bucket back into Load balancing group.
263 container l3vpn-lb-nexthops {
267 leaf group-id { type string; }
268 leaf nexhop-key { type string; }
269 leaf-list target-device-id { type string;
270 //dpId or ip-address }
274 container dpid-l3vpn-lb-nexthops {
276 list dpn-lb-nexthops {
277 key "src-dp-id dst-device-id";
278 leaf src-dp-id { type uint64; }
279 leaf dst-device-id { type string;
280 //dpId or ip-address }
281 leaf-list nexthop-keys { type string; }
285 ECMP forwarding through multiple Compute Node and VMs
286 -----------------------------------------------------
287 In some cases, extra route can be added which can have reachability through
288 multiple Nova VMs. These VMs can be either connected on same compute node
289 (or) different Compute Nodes. When VMs are in different compute nodes, DC-GW
290 should learn all the route paths such that ECMP behavior can be applied for
291 these multi path routes. When VMs are co-located in same compute node, DC-GW
292 will not perform ECMP and compute node performs traffic splitting instead.
294 ECMP forwarding for dispersed VMs
295 ---------------------------------
296 When configured extra route are reached through nova VMs which are connected
297 with different compute node, then it is ODL responsibility to advertise these
298 multiple route paths (but with same MPLS label) to Quagga BGP which in turn
299 sends these routes into DC-GW. But DC-GW replaces the existing route with a new
300 route received from the peer if the NLRI (prefix) is same in the two routes.
301 This is true even when multipath is enabled on the DC-GW and it is as per standard
302 BGP RFC 4271, Section 9 UPDATE Message Handling. Hence the route is lost in DC-GW
303 even before path computation for multipath is applied.This scenario is solved by
304 adding multiple route distinguisher (RDs) for the vpn instance and let ODL uses
305 the list of RDs to advertise the same prefix with different BGP NHs. Multiple RDs
306 will be supported only for BGP VPNs.
308 ECMP forwarding for co-located VMs
309 -----------------------------------
310 When extra routes on VM interfaces are connected with same compute node, LFIB/FIB
311 and Terminating service table flow entries should be programmed so that traffic can
312 be load balanced between local VMs. This can be done by creating load balancing next
313 hop group for each vpn-to-extraroute (if nexthop-ip-list size is greater than 1) with
314 buckets pointing to the actual VMs next hop group on source Compute Node. Even for the
315 co-located VMs, VPN interface manager should assign separate RDs for each adjacency of
316 same dest IP prefix and let route can be advertised again to Quagga BGP with same next
317 hop (TEP IP address). This will enable DC-Gateway to realize ECMP behavior when an IP
318 prefix can be reachable through multiple co located VMs on one Compute Node and an
319 another VM connected on different Compute Node.
321 To create load balancing next hop group, the dest IP prefix is used as the key to
322 generate group id. When any of next hop is removed, then adjust load balancing nexthop
323 group so that traffic can be sent through active next hops.
325 ECMP forwarding through two DC-Gateways
326 ---------------------------------------
327 The current ITM implementation provides support for creating multiple GRE tunnels for
328 the provided list of DC-GW IP addresses from compute node. This should help in creating
329 corresponding load balancing group whenever Quagga BGP is advertising two routes on same
330 IP prefix pointing to multiple DC GWs. The group id of this load balancing group can be
331 derived from sorted order of DC GW TEP IP addresses with the following format dc_gw_tep_ip
332 _address_1: dc_gw_tep_ip_address_2. This will be useful when multiple external IP prefixes
333 share the same next hops. The load balancing next hop group buckets is programmed according
334 to sorted remote end point DC-Gateway IP address. The support of action move:NXM_NX_REG0(1)
335 -> MPLS Label is not supported in ODL openflowplugin. It has to be implemented. Since there
336 are two DC gateways present for the data center, it is possible that multiple equal cost
337 routes are supplied to ODL by Quagga BGP like Fig 2. The current Quagga BGP doesn’t have
338 multipath support and it will be done. When Quagga BGP announces route with multiple
339 paths, then it is ODL responsibility to program Fib entries in all compute nodes where
340 VPN instance blueprint is present, so that traffic can be load balanced between these
341 two DC gateways. It requires changes in existing odl-fib.yang model (like below) to
342 support multiple routes for same destination IP prefix.
344 BGPManager should be able to create vrf entry for the advertised IP prefix with multiple
345 route paths. VrfEntryListener listens to DCN on these vrf entries and program Fib entries
346 (21) based on number route paths available for given IP prefix. For the given (external)
347 destination IP prefix, if there is only one route path exists, use the existing approach
348 to program FIB table flow entry matches on (vpnid, ipv4_dst) and actions with push mpls
349 label and output to gre tunnel port. For the given (external) destination IP prefix, if
350 there are two route paths exist, then retrieve next hop ip address from routes list in
351 the same sorted order (i.e. using same logic which is used to create buckets for load
352 balancing next hop group for DC- Gateway IP addresses), then program FIB table flow entry
353 with an instruction like Fig 3. It should have two set field actions where first action sets
354 mpls label to NX_REG0 for first sorted DC-GW IP address and second action sets mpls label
355 to NX_REG1 for the second sorted DC-GW IP address. When more than two DC Gateways are used,
356 then more number of NXM Registries have to be used to push appropriate MPLS label before
357 sending it to next hop group. It needs operational DS container to have mapping between DC
358 Gateway IP address and NXM_REG. When one of the route is withdrawn for the IP prefix, then
359 modify the FIB table flow entry with with push mpls label and output to the available
362 ECMP for Intra-DC L3VPN communication
363 -------------------------------------
364 ECMP within data center is required to load balance the data traffic when extra route can
365 be reached through multiple next hops (i.e. Nova VMs) when these are connected with different
366 compute nodes. It mainly deals with how Compute Nodes can spray the traffic when dest IP prefix
367 can be reached through two or more VMs (next hops) which are connected with multiple compute
369 When there are multiple RDs (if VPN is of type BGP VPN) assigned to VPN instance so that VPN
370 engine can be advertise IP route with different RDs to achieve ECMP behavior in DC-GW as
371 mentioned before. But for intra-DC, this doesn’t make any more sense since it’s all about
372 programming remote FIB entries on computes nodes to achieve data traffic
374 Irrespective of RDs, when multiple next hops (which are from different Compute Nodes) are
375 present for the extra-route adjacency, then FIB Manager has to create load balancing next
376 hop group in remote compute node with buckets pointing with targeted Compute Node VxLAN
378 To allocate group id for this load balancing next hop, the same destination IP prefix is
379 used as the group key. The remote FIB table flow should point to this next hop group after
380 writing prefix label into tunnel_id. The bucket weight of remote next hop is adjusted
381 according to number of VMs associated to given extra route and on which compute node
382 the VMs are connected. For example, two compute node having one VM each, then bucket
383 weight is 50 each. One compute node having two VMs and another compute node having one
384 VM, then bucket weight is 66 and 34 each. The hop-count property in vrfEntry data store
385 helps to decide what is the bucket weight for each bucket.
387 ECMP Path decision based on Internal/External Tunnel Monitoring
388 ---------------------------------------------------------------
389 ODL will use GRE-KA or BFD protocol to implement monitoring of GRE external tunnels.
390 This implementation detail is out of scope in this document. Based on the tunnel state,
391 GRE Load Balancing Group is adjusted accordingly as mentioned like below.
393 GRE tunnel state handling
394 -------------------------
395 As soon as GRE tunnel interface is created in ODL, interface manager uses alivenessmonitor
396 to monitor the GRE tunnels for its liveness using GRE Keep-alive protocol. When tunnel state
397 changes, it has to handled accordingly to adjust above load balancing group so that data
398 traffic is sent to only active DC-GW tunnel. This can be done with listening to update
400 When one GRE tunnel is operationally going down, then retrieve the corresponding bucket
401 from the load balancing group and delete it.
402 When GRE tunnel comes up again, then add bucket back into load balancing group and
404 When both GRE tunnels are going down, then just recreate load balancing group with empty.
405 Withdraw the routes from that particular DC-GW.
406 With the above implementation, there is no need of modifying Fib entries for GRE tunnel
408 But when BGP Quagga withdrawing one of the route for external IP prefix, then reprogram
409 FIB flow entry (21) by directly pointing to output=<gre_port> after pushing MPLS label.
411 VxLAN tunnel state handling
412 ---------------------------
413 Similarly, when VxLAN tunnel state changes, the Load Balancing Groups in Compute Nodes have
414 to be updated accordingly so that traffic can flow through active VxLAN tunnels. It can be
415 done by having config mapping between target data-path-id to next hop group Ids
417 For both GRE and VxLAN tunnel monitoring, L3VPN has to implement the following YANG model
418 to update load balancing next hop group buckets according to tunnel status.
419 When one of the VxLAN/GRE tunnel is going down, then retrieve nexthop-key from
420 dpid-l3vpn-lb-nexthops by providing tep-device-id’s from src-info and dst-info of
421 StateTunnelList while handling its update DCN.
422 After retrieving next hop key, fetch target-device-id list from l3vpn-lb-nexthops
423 and reprogram VxLAN/GRE load balancing group in each remote Compute Node based on
424 tunnel state between source and destination Compute Node. Similarly, when tunnel
425 comes up, then logic have to be rerun to add its bucket back into
426 Load balancing group.
430 The support for action move:NXM_NX_REG0(1) -> MPLS Label is already available
435 This feature support all the following Reboot Scenarios for EVPN:
436 * Entire Cluster Reboot
438 * Candidate PL reboot
439 * OVS Datapath reboots
440 * Multiple PL reboots
441 * Multiple Cluster reboots
442 * Multiple reboots of the same OVS Datapath.
443 * Openstack Controller reboots
445 Clustering considerations
446 -------------------------
447 The feature should operate in ODL Clustered environment reliably.
449 Other Infra considerations
450 --------------------------
453 Security considerations
454 -----------------------
457 Scale and Performance Impact
458 ----------------------------
459 Not covered by this Design Document.
467 Alternatives considered and why they were not selected.
474 This feature doesn't add any new karaf feature.
486 Manu B <manu.b@ericsson.com>
487 Kency Kurian <kency.kurian@ericsson.com>
488 Gobinath <gobinath@ericsson.com>
489 P Govinda Rajulu <p.govinda.rajulu@ericsson.com>
492 Periyasamy Palanisamy <periyasamy.palanisamy@ericsson.com>
500 Quagga BGP multipath support and APIs. This is needed to support when two DC-GW advertises
501 routes for same external prefix with different route labels
502 GRE tunnel monitoring. This is need to implement ECMP forwarding based on MPLSoGRE tunnel state.
503 Support for action move:NXM_NX_REG0(1) -> MPLS Label in ODL openflowplugin
507 Capture details of testing that will need to be added.
511 Appropriate UTs will be added for the new code coming in once framework is in place.
515 There won't be any Integration tests provided for this feature.
519 CSIT will be enhanced to cover this feature by providing new CSIT tests.
523 This will require changes to User Guide and Developer Guide.
527 [1] https://docs.google.com/document/d/1KRxrIGCLCBuz2D8f8IhU2I84VrM5EMa1Y7Scjb6qEKw/edit#