07-18-2021 03:17 AM
Hello,
We have a very simple setup:
The setup:
In MPLS environment, PE1 (ASR9001) router (with IOS-XR 6.6.3) peering with upstream Internet provider - ISP1 (with a standard single hop eBGP session) and receiving the full Internet table + default route from ISP1 into the default routing table (not in VRF).
The same PE1 router is forming MP-IBGP sessions with few dedicated BGP route reflectors (RRs) and for IPv4 AFI it's sending the default route (learnt from ISP1) with next-hop-self to the RRs (and few other customers' routes, not related to ISP1). On the opposite direction - all RRs are advertising 0.0.0.0/0 to that PE1 coming from different Internet facing PEs and different ISPs.
There is no manipulation of BGP attributes for 0.0.0.0/0 prefix (except next-hop-self from PEs), so PE1 selects the externally learnt 0.0.0.0/0 route from its eBGP session to ISP1. All RRs are sending "Additional Paths" to all PEs. BGP PIC Edge is enabled on PEs (including PE1).
The problem:
All good so far, unless there is a ISP DIA circuit failure (there's a working BFD session between PE1 and ISP1). I simulate the failure manually disabling the physical interface between PE1 and ISP1 (on PE1 end).
After that few things happen:
I haven't found a lot regarding RIB Quarantining, looks like it's some kind of protection mechanism from route oscillations. I did checked and ensured that IGP is completely stable and there were no BGP updates going to the RRs (as I wrote - only default route and few internal routes are sent to RRs, NOT the full internet BGP table). ISP1 is NOT advertising P2P /30 subnet into the BGP table, nor PE1 does, however ISP1 advertises a larger block /9 (and 0/0). I tried to disable the default RIB dampening mechanism with:
router rib address-family ipv4 (hidden command) next-hop dampening disable
but noting changes. 0/0 was still marked as Quarantined for 2.5 minutes.
This problem has been temporarily solved by permitting only 0/0 from ISP1 and filtering everything else from ISP1 on PE1.
The question is - what might be the reasons for this behavior? Could it be the size of the global Internet table and the way PE1 (ASR9001) is processing it? My expectations is that once physical interface is down and eBGP session is down - it should immediately withdrawn all routes with the next-hop ISP1 (unreachable). Could it be because of that /9 route (which includes the eBGP peer address, although it's coming from the same neighbor). And it took 2:30 minutes to release the quarantined 0.0.0.0/0 route.
I tried to simulate the setup (again with larger prefix including the P2P, but it worked as expected, however simulated ISP was sending few routes only (not 800K+ as the real one).
Any suggestions/thought are highly appreciated.
Regards,
Plamen
07-18-2021 02:32 PM
Can you draw topology
07-19-2021 02:02 AM
07-18-2021 06:07 PM
Quarantined routes happen for a few reasons, either the route is flapping in and out of the RIB very often or the next-hop is flapping often.
A few commands need to be gathered when this happened immediately:
show rib next-hop
show rib next-hop damped
show rib history
show route <prefix>
show route resolving-next-hop <prefix>
show bgp <AFI> <prefix>
show bgp <AFI> nexthops
show bgp <AFI> dampened-paths
Also a show tech rib and routing bgp for any traces.
Given your symptoms it sounds like the next-hop is flapping.
Sam
07-19-2021 02:26 AM
Thanks for your input here Sam, I'll include all these for the next available window. Positive thing is that this behavior is easily reproducible (not in the LAB, though).
That's my understanding as well for the flapping next-hop or route, however I don't see anything flapping. Moreover "show route 0.0.0.0/0" shows the primary eBGP path via ISP1 (going to the directly connected interface) and a backup route (next best path, recursively using IBGP -> IGP). When I manually shutdown the ISP facing interface, there are no IGP changes (there's no redistribute connected and the ISP facing interface is not included into IGP process). I'm also worried about the log message logged immediately after the physical interface is disabled:
RP/0/RSP0/CPU0: ipv4_rib[1197]: %ROUTING-RIB-7-SERVER_ROUTING_DEPTH : Recursion loop looking up prefix [ISP1 IP address] in Vrf: "default" Tbl: "default" Safi: "Unicast" added by bgp
The only other path to [ISP1 IP address] when the directly connected network goes down is via a summary /9 eBGP route coming from the same eBGP peer, via the same administratively disabled interface, which shouldn't be in the RIB anymore (last time I haven't been able to check "show route [ISP1 IP address]" when interface is down) or via 0.0.0.0/0 (again originated by ISP1 or by different ISP and IBGP peer with the RRs)
Regards,
Plamen
07-26-2021 10:16 AM
Hello Sam,
We have exactly the same behavior on PE2 (ASR9001 platform again, but with much older code - IOS-XR 5.3.2). So there's something terribly wrong here, either with a configuration or a bug related to a specific setting (I'll try to simplify the config as much as possible during next maintenance window, disabling BFD and some extra BGP config).
It's really annoying problem, dropping traffic for almost 3 minutes, although there's already a backup path pre-calculated for 0/0 and inserted into the FIB table.
Again if I only allow 0/0 and block everything else coming from the directly connected ISP, there is NO issue.
Some of the requested outputs taken from PE1 (ASR9001 with IOS-XR 6.6.3) are below. I've obscured some of the sensitive information (IPs, ASN, etc.).
ISP1 circuit is physically & logically terminated on PE1 interface Te0/0/2/3. ISP1 is using 203.0.113.1/30, PE1 is using 203.0.113.2/30 on Te0/0/2/3.
All of the outputs (except the last one) are during the issue (~2.5 minutes)
show rib next-hop
RP/0/RSP0/CPU0:PE1#show rib next-hop Sun Jul 25 22:11:14.199 CDT Registered nexthop notifications: A - Active route, B - First backup route. (A) 0.0.0.0/0 via 203.0.113.1 - TenGigE0/0/2/3, ospf/node0_RSP0_CPU0 (A) 203.0.113.1/32 via 0.0.0.0 - None, bgp/node0_RSP0_CPU0
It is showing ospf here, which is strange for me, considering the fact 0.0.0.0/0 is not into OSPF database.
There is "redistribute connected" under the OSPF process with a route-policy matching only downlink interfaces and loopback0 (so ISP facing interface is not included into the OSPF process, not redistributed,not into ospf database either). Of course there is no OSPF to BGP or BGP to OSPF redistribution.
RP/0/RSP0/CPU0:PE1#show rib next-hop 0.0.0.0/0
RP/0/RSP0/CPU0:PE1#show rib next-hop 0.0.0.0/0 Sun Jul 25 22:11:17.257 CDT Firsthop prefix: 0.0.0.0/0 Flags: exact match, allow default, recurse Last event occurred Jul 18 01:07:59.525, 1w0d ago; version 155 Registered clients: ospf/node0_RSP0_CPU0 created Apr 9 14:43:26.385, 1y15w ago read last notification at Jul 18 01:07:59.528, 1w0d ago reference count is 1 Destination paths: 203.0.113.1 - TenGigE0/0/2/3 Resolving route: 0.0.0.0/0 known via "bgp OUR-PUBLIC-ASN#" Metric computed: 0
RP/0/RSP0/CPU0:PE1#show rib next-hop 203.0.113.1/32 Sun Jul 25 22:11:21.019 CDT Firsthop prefix: 203.0.113.1/32 Flags: recurse Last event occurred Jul 25 22:10:39.844, 00:00:41 ago; version 26 Registered clients: bgp/node0_RSP0_CPU0 created Jun 4 15:22:06.021, 1y07w ago read last notification at Jul 25 22:10:39.847, 00:00:41 ago reference count is 1 Firsthop is unresolved
RP/0/RSP0/CPU0:PE1#show rib next-hop damped Sun Jul 25 22:11:24.587 CDT Damped nexthop notifications: A - Active route, B - First backup route.
No damped routes
RP/0/RSP0/CPU0:PE1#show rib history Sun Jul 25 22:11:27.669 CDT JID Client (CID) Location 0 bcdl_ug (1) node0_RSP0_CPU0 JID Client (CID) Location 1029 ospf (15) node0_RSP0_CPU0 Table ID: 0xe0000000 C 203.0.113.0/30 deleted, 3 00:00:48 L 203.0.113.2/32 deleted, 3 00:00:48 C 203.0.113.0/30[0/0] update, 1 paths, 0x1082 4 1w0d L 203.0.113.2/32[0/0] update, 1 paths, 0x1081 3 1w0d JID Client (CID) Location 1083 bgp (17) node0_RSP0_CPU0 JID Client (CID) Location 0 bcdl_ug (18) node0_RSP0_CPU0 Table ID: 0xe0000000 B 0.0.0.0/0 [20/0] update, 1 paths, 0x0004 10 00:00:48 L 203.0.113.2/32 deleted, 3 00:00:48 C 203.0.113.0/30 deleted, 3 00:00:48 B 201.220.154.0/24 deleted, 12 00:01:00 B 200.198.192.0/18 deleted, 12 00:01:08 B 138.207.67.0/24 deleted, 12 00:01:09 B 138.207.66.0/24 deleted, 12 00:01:09 B 197.186.0.0/15 deleted, 12 00:01:10 B 222.54.224.0/19[20/24811] update, 1 paths, 0x0200 12 00:01:12 B 220.158.174.0/23[20/20120] update, 1 paths, 0x0200 12 00:01:12 B 220.158.172.0/23[20/20120] update, 1 paths, 0x0200 12 00:01:12 B 216.171.184.0/21[20/22573] update, 1 paths, 0x0200 12 00:01:12 JID Client (CID) Location 0 bcdl_ug (19) node0_RSP0_CPU0 Table ID: 0xe0000029 B 0.0.0.0/0 [200/0] update, 1 paths, 0x0004 12 00:00:50 Table ID: 0xe0000027 B 0.0.0.0/0 [200/0] update, 1 paths, 0x0004 12 00:00:50 Table ID: 0xe0000012 JID Client (CID) Location 1224 mpls_ldp (20) node0_RSP0_CPU0 Table ID: 0xe0000000 C 203.0.113.0/30 deleted, 3 00:00:52 L 203.0.113.2/32 deleted, 3 00:00:52
show route 0.0.0.0/0
RP/0/RSP0/CPU0:PE1#show route 0.0.0.0/0 Sun Jul 25 22:11:35.548 CDT Routing entry for 0.0.0.0/0 Known via "bgp OUR-PUBLIC-ASN#", distance 20, metric 0, candidate default path Tag ISP-PUBLIC-ASN# Number of pic paths 1 , type internal and external Installed Jul 23 00:51:15.216 for 2d21h Routing Descriptor Blocks PE2-Loopback-IP, from RR1, BGP backup path Route metric is 0 203.0.113.1, from 203.0.113.1 (quarantined), BGP external Route metric is 0 No advertising protos.
203.0.113.1 shouldn't be here anymore, since the bgp sesson is in idle state, because physical interface is admin down
RP/0/RSP0/CPU0:PE1#show route 203.0.113.1 Sun Jul 25 22:11:38.938 CDT Routing entry for 0.0.0.0/0 Known via "bgp OUR-PUBLIC-ASN#", distance 20, metric 0, candidate default path Tag ISP-PUBLIC-ASN# Number of pic paths 1 , type internal and external Installed Jul 23 00:51:15.216 for 2d21h Routing Descriptor Blocks PE2-Loopback-IP, from RR1, BGP backup path Route metric is 0 203.0.113.1, from 203.0.113.1 (quarantined), BGP external Route metric is 0 No advertising protos.
RP/0/RSP0/CPU0:PE1#show route resolving-next-hop 0.0.0.0 Sun Jul 25 22:11:42.207 CDT % Network not in table
show route resolving-next-hop 203.0.113.1
RP/0/RSP0/CPU0:PE1#show route resolving-next-hop 203.0.113.1 Sun Jul 25 22:11:45.485 CDT % Network not in table
show bgp 0.0.0.0/0
RP/0/RSP0/CPU0:PE1#show bgp 0.0.0.0/0 Sun Jul 25 22:11:48.479 CDT BGP routing table entry for 0.0.0.0/0 Versions: Process bRIB/RIB SendTblVer Speaker 322484399 322484399 Last Modified: Jul 25 22:11:04.759 for 00:00:43 Paths: (27 available, best #3) Advertised IPv4 Unicast paths to update-groups (with more than one peer): 0.3 0.4 Advertised IPv4 Unicast paths to peers (in unique update groups): Customer Path #1: Received by speaker 0 Not advertised to any peer ISP21_ASN# PE21 (metric 2001) from RR1 (PE21) Origin IGP, metric 0, localpref 100, valid, internal, backup, add-path, import-candidate Received Path ID 0, Local Path ID 34, version 322484399 Originator: PE21, Cluster list: 1 .... Path #3: Received by speaker 0 Advertised IPv4 Unicast paths to update-groups (with more than one peer): 0.3 0.4 Advertised IPv4 Unicast paths to peers (in unique update groups): Customer ISP2_ASN# PE2 (metric 1000) from RR1 (PE2) Origin IGP, localpref 100, valid, internal, best, group-best, import-candidate Received Path ID 3, Local Path ID 1, version 322484399 Originator: PE2, Cluster list: 1
There are 27 paths for 0/0, the proper one is selected as a best and other one is a backup route. Keep in mind that all RRs are reflecting only the default route originated by ISPs and few local routes (not the full Internet BGP table, which is local to the PEs)
RP/0/RSP0/CPU0:PE1#show bgp dampened-paths Sun Jul 25 22:11:54.826 CDT
show route 0.0.0.0/0
RP/0/RSP0/CPU0:PE1#show route 0.0.0.0/0 Sun Jul 25 22:11:59.729 CDT Routing entry for 0.0.0.0/0 Known via "bgp OUR-PUBLIC-ASN#", distance 20, metric 0, candidate default path Tag ISP-PUBLIC-ASN# Number of pic paths 1 , type internal and external Installed Jul 23 00:51:15.217 for 2d21h Routing Descriptor Blocks PE2, from RR1, BGP backup path Route metric is 0 203.0.113.1, from 203.0.113.1 (quarantined), BGP external Route metric is 0 No advertising protos.
PE2 should become primary, 203.0.113.1 shouldn't be there anymore, however it's shown as "quarantined"
RP/0/RSP0/CPU0:PE1#show cef 0.0.0.0/0 Sun Jul 25 22:12:05.498 CDT 0.0.0.0/0, version 573914642, proxy default, internal 0x1000011 0x0 (ptr 0x9dfb7068) [1], 0x0 (0x0), 0x0 (0x0) Updated Jul 25 22:10:39.852 Prefix Len 0, traffic index 0, precedence n/a, priority 4 via PE2-Loopback/32, 8 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa57fa3a0 0x0] next hop PE2-Loopback/32 via PE2-Loopback/32
Although CEF is showing the correct entry, router is still blocking traffic and RIB is incorrect:
RP/0/RSP0/CPU0:PE1#show route 203.0.113.1 Sun Jul 25 22:12:18.053 CDT Routing entry for 0.0.0.0/0 Known via "bgp OUR-PUBLIC-ASN#", distance 20, metric 0, candidate default path Tag ISP1-PUBLIC-ASN# Number of pic paths 1 , type internal and external Installed Jul 23 00:51:15.217 for 2d21h Routing Descriptor Blocks PE2, from RR1, BGP backup path Route metric is 0 203.0.113.1, from 203.0.113.1 (quarantined), BGP external Route metric is 0 No advertising protos. RP/0/RSP0/CPU0:PE1#
Fixed after few minutes:
RP/0/RSP0/CPU0:PE1#show route 0.0.0.0/0 Sun Jul 25 22:13:02.102 CDT Routing entry for 0.0.0.0/0 Known via "bgp OUR-PUBLIC-ASN#", distance 200, metric 0, candidate default path Tag ISP2-BGP-ASN# Number of pic paths 1 , type internal Installed Jul 25 22:12:41.442 for 00:00:20 Routing Descriptor Blocks PE21, from RR1, BGP backup path Route metric is 0 PE2, from RR1 Route metric is 0 No advertising protos. RP/0/RSP0/CPU0:PE1#
Regards,
Plamen
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide