01-27-2021 05:30 AM - edited 01-27-2021 05:36 AM
In simple terms, 'Route Churn' is defined as the 'rate of change of prefixes'. Different XR versions across 4.x to 7.x have differing behavior & support for the BGP churn handling and some enhancements made from 6.5.3 onwards (listed in appendix) makes BGP churn handling lot more graceful and less impacting than earlier releases. BGP features like 'add-path all' further adds to the delays due to sub-optimized handling of churn in older releases (without proposed enhancements per Appendix), and should be carefully used.
It's not fairly uncommon to get affected by BGP churn and notice the impacting symptoms, which are commonly in the form of one of these ;
- high BGP process utilization, < show process cpu >
- high InQ/OutQ numbers, < sh bgp summary all all | ex 0 0 >
- unusually rapid increase of BGP table version < sh bgp summary all all >
In one of the large Service provider's network, the issue came to surface when a few instances if stale routing info were observed, especially when Route reflectors did not cascade 'withdraws' for an extended duration of time, causing multiple outage and impacting situations. The actual symptoms may differ on a case-by-case basis, but the methodology here can help in identifying and investigating the churn.
Identifying the cause of churn- what are the actual prefixes that are causing the excessive churn is certainly a non-trivial task and that is what we'll attempt to depict in this article. There's no official churn handling guide as such, and this is a work based out of several iterations of BGP churn troubleshooting across several months and were able to identify and mitigate(via version upgrade).
Please note that whats shared here is a process and not just a command. There may be variations in the cause and form of churn and hence its not practical to enlist all the possible scenarios and commands to be used, rather the intent is to guide the user by providing a key to start and then things to look out for, to further close-in on the specific cause.
Ways to identify BGP churn on Cisco ASR9000
The below steps assumes that some symptoms related to BGP process churn (as indicated above) have been observed and the next steps are to try and identify the reason behind it.
Summarizing the procedure for automated logic (can be used as guidelines for NOCs):
sh bgp all all summ | i main
BGP main routing table version 236735
BGP main routing table version 54398
BGP main routing table version 1686672
If any of the values shows increase of 200 or above in any of the iterations, go to next step #2 else Abort (ignore)
2. Collect this data- 4 times for interval of 10 sec:
sh bgp all all summ | ex 0 0 Wed Oct 9 22:44:33.349 GMT
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 54406
BGP is operating in STANDALONE mode.
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 54406 54398 54398 54398 54398 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.169.2.201 0 AS# 111939 178317 54398 658 0 16:57:55 13127
10.169.2.202 0 AS# 215093 179624 54398 707 0 16:57:53 10926
10.169.2.203 0 AS# 237436 178593 54398 726 0 16:57:55 11950
10.177.6.13 0 AS# 42564 179273 54398 749 0 16:57:53 1752
10.177.6.14 0 AS# 176405 164693 54398 662 0 16:42:38 1996
10.177.6.21 0 AS# 138414 178988 54398 696 0 16:57:53 1409
10.177.6.22 0 AS# 126114 180066 54398 708 0 16:57:53 1191
If any of the InQ/OutQ values as in the output above is found above 500, save values of main routing table version over two consecutive iteration of the command- so there will be one iteration of the command with lower main routing table version value (named X) and next iteration, with higher routing table version value (named Y) with InQ/OutQ > 500, go to next step AND Raise SR with Cisco support to further triage. Else, if values < 500, ignore/abort
Sh bgp all all version X (#small value) Y (#higher value)
It will yield result like below, save the prefixes and go to next step #4.
sh bgp all all ver 95636407 95636517
Thu Oct 10 00:06:28.696 GMT
VRF: default
------------
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 1830676
Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Version Path
Network Next Hop Metric LocPrf Version Path
*>i10.192.32.2/31 10.169.7.137 0 100 AS# AS# ?
* i 10.169.7.137 0 100 AS# AS# ?
* i 10.169.7.136 100 AS# AS# ?
* i104.x.x.0/24 10.188.62.55 0 100 AS# AS# AS# AS# i
* i 10.169.6.197 100 AS# AS# AS# AS# i
* i 10.169.6.196 100 AS# AS# AS# AS# i
* i 10.188.62.63 0 100 AS# AS# AS# AS# i
* i 10.188.62.81 0 100 AS# AS# AS# AS# i
* i 10.169.6.98 0 100 AS# AS# AS# AS# i
* i 10.169.6.196 100 AS# AS# AS# AS# i
* i 10.188.62.61 0 100 AS# AS# AS# AS# i
* i 10.188.62.59 0 100 AS# AS# AS# AS# i
* i 10.169.6.140 0 100 AS# AS# AS# AS# i
* i 10.188.62.73 0 100 AS# AS# AS# AS# i
* i 10.188.62.69 0 100 AS# AS# AS# AS# i
* i 10.188.62.67 0 100 AS# AS# AS# AS# i
* i 10.188.62.65 0 100 AS# AS# AS# AS# i
* i 10.188.62.49 0 100 AS# AS# AS# AS# i
* i 10.188.62.51 0 100 AS# AS# AS# AS# i
* i 10.188.62.53 0 100 AS# AS# AS# AS# i
* i 10.188.62.55 0 100 AS# AS# AS# AS# i
*>i 10.188.62.57 0 100 AS# AS# AS# AS# i
* i 10.188.62.75 0 100 AS# AS# AS# AS# i
* i 10.188.62.77 0 100 AS# AS# AS# AS# i
* i 10.188.62.79 0 100 AS# AS# AS# AS# i
Processed 2 prefixes, 25 paths
Sh bgp afi/safi <prefix> detail
Sh route <prefix> detail
Sh bgp afi/safi <prefix> path-elem
Sh bgp scale
sh bgp all all summ | i main
Tue Oct 15 21:03:47.642 GMT
BGP main routing table version 512162
BGP main routing table version 87899
BGP main routing table version 26492532
sh bgp all all summ | i main
Tue Oct 15 21:03:50.361 GMT
BGP main routing table version 512162
BGP main routing table version 87899
BGP main routing table version 26492534 <<<< Increase by 2 in 3s
sh bgp all all summ | i main
Tue Oct 15 21:03:52.801 GMT
BGP main routing table version 512162
BGP main routing table version 87899
BGP main routing table version 26492537 <<<< Increase by 3 in 2s
sh bgp all all summ | i main
Wed Oct 9 22:43:22.490 GMT
BGP main routing table version 236734
BGP main routing table version 54398
BGP main routing table version 1686474
sh bgp all all summ | i main
Wed Oct 9 22:43:25.878 GMT
BGP main routing table version 236735
BGP main routing table version 54398
BGP main routing table version 1686563 <<<< Increase by ~90 in 3s
sh bgp all all summ | i main
Wed Oct 9 22:43:30.268 GMT
BGP main routing table version 236735
BGP main routing table version 54398
BGP main routing table version 1686672 <<<< Increase by ~110 in 5s
Clearly a much HIGHER CHURN there !!
sh bgp vpnv6 uni summ | ex 0 0
Wed Oct 9 22:44:33.349 GMT
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 54406
BGP is operating in STANDALONE mode.
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 54406 54398 54398 54398 54398 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.169.2.201 0 AS# 111939 178317 54398 658 0 16:57:55 13127
10.169.2.202 0 AS# 215093 179624 54398 707 0 16:57:53 10926
10.169.2.203 0 AS# 237436 178593 54398 726 0 16:57:55 11950
10.177.6.13 0 AS# 42564 179273 54398 749 0 16:57:53 1752
10.177.6.14 0 AS# 176405 164693 54398 662 0 16:42:38 1996
10.177.6.21 0 AS# 138414 178988 54398 696 0 16:57:53 1409
10.177.6.22 0 AS# 126114 180066 54398 708 0 16:57:53 1191
sh bgp vpnv4 uni summ | ex 0 0
Wed Oct 9 22:44:50.415 GMT
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 236743 >>> 182,337 increase in 17s
BGP is operating in STANDALONE mode.
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 236743 236740 236740 236740 236740 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.169.2.199 0 AS# 109683 179440 236740 5 0 16:58:12 60353
10.169.2.201 0 AS# 111955 178317 236740 642 0 16:58:12 47605
10.169.2.202 0 AS# 215143 179624 236740 657 0 16:58:10 35427
10.169.2.203 0 AS# 237486 178593 236740 676 0 16:58:12 12798
10.177.6.13 0 AS# 42614 179284 236740 699 0 16:58:10 11105
10.177.6.14 0 AS# 176455 164704 236740 612 0 16:42:56 5115
10.177.6.21 0 AS# 138444 178999 236740 666 0 16:58:10 19651
10.177.6.22 0 AS# 126164 180066 236740 658 0 16:58:10 14604
sh bgp ipv4 uni summ | ex 0 0
Wed Oct 9 22:44:58.769 GMT
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 1688157 >>> 1,451,414 increase in 8s
BGP is operating in STANDALONE mode.
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1688158 1688143 1688143 1688143 1688143 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.169.2.199 0 AS# 109683 179440 1688143 5 0 16:58:20 23784
10.169.2.201 0 AS# 111962 178317 1686993 635 0 16:58:20 18754
10.169.2.202 0 AS# 215143 179635 1688143 657 0 16:58:19 11275
10.169.2.203 0 AS# 237486 178604 1688143 676 0 16:58:20 6417
10.177.6.3 0 AS# 16487 180440 1688143 1 0 16:58:19 2189
10.177.6.13 0 AS# 42614 179284 1688143 699 0 16:58:18 5276
10.177.6.14 0 AS# 176455 164704 1688143 612 0 16:43:04 955
10.177.6.21 0 AS# 138444 178999 1688143 666 0 16:58:19 1450
10.177.6.22 0 AS# 126164 180077 1688143 658 0 16:58:19 2046
Checking the 1st commands again during churn
sh bgp all all summ | i main
Wed Oct 9 22:45:12.786 GMT
BGP main routing table version 236747
BGP main routing table version 54406
BGP main routing table version 1688477
sh bgp all all summ | i main
Wed Oct 9 22:45:16.773 GMT
BGP main routing table version 236757
BGP main routing table version 54414
BGP main routing table version 1688580 <<< increase by ~100 in 4s
Collect this command for Cisco BGP triage purpose (ideally- 3 times in interval of 5sec during high churn) :
sh bgp scale
Wed Oct 9 22:45:25.683 GMT
VRF: default
Neighbors Configured: 116 Established: 116
Address-Family Prefixes Paths PathElem Prefix Path PathElem
Memory Memory Memory
VPNv4 Unicast 178923 357921 178923 25.25MB 30.04MB 18.60MB
VPNv6 Unicast 47614 95161 47614 7.27MB 7.99MB 4.95MB
IPv4 Unicast 36695 142468 145857 5.18MB 11.96MB 15.16MB
------------------------------------------------------------------------------
Total 263232 595550 372394 37.70MB 49.98MB 38.71MB
Total VRFs Configured: 0
The above commands just validate that there’s an ongoing high rate of BGP churn in the network.
To find the prefixes that are participating in the churn, here’s the summarized methodology:
Example:
Step 1:
sh bgp ipv4 uni summary | excl 0 0
Thu Oct 10 00:05:48.964 GMT
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 1829576
BGP is operating in STANDALONE mode.
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1829577 1827602 1827602 1827602 1827602 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.169.2.201 0 AS# 134006 178838 1827602 641 0 18:19:11 18763
<< peering which is churning at present, InQ stuck
sh bgp ipv4 uni summary
Thu Oct 10 00:05:59.755 GMT
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 1830097
BGP is operating in STANDALONE mode.
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 1830098 1829928 1829928 1829928 1829928 0
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
10.169.2.201 0 AS# 134096 178838 1827602 734 0 18:19:21 18761
<< peering which is churning at present, InQ stuck
Step 2: Note the “main table version” for a snapshot and invoke “show bgp <afi> <safi> $version – 5 $version(or a higher number)”
We’re basically running the “sh bgp <afi><safi> summary” output twice and comparing among the two version values.
sh bgp ipv4 unicast version 1827602<< small value 1829928 << higher value
Thu Oct 10 00:06:28.696 GMT
VRF: default
------------
BGP router identifier 10.169.2.200, local AS number AS#
BGP main routing table version 1830676
Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Version Path
*>i10.192.32.2/31 10.169.7.137 0 100 AS# AS# ?
* i 10.169.7.137 0 100 AS# AS# ?
* i 10.169.7.136 100 AS# AS# ?
* i104.x.x.0/24 10.188.62.55 0 100 AS# AS# AS# AS# i
* i 10.169.6.197 100 AS#f AS# AS# AS# i
* i 10.169.6.196 100 AS# AS# AS# AS# i
* i 10.188.62.63 0 100 AS# AS# AS# AS# i
* i 10.188.62.81 0 100 AS# AS# AS# AS# i
* i 10.169.6.98 0 100 AS# AS# AS# AS# i
* i 10.169.6.196 0 100 AS# AS# AS# AS# i
* i 10.188.62.61 0 100 AS# AS# AS# AS# i
* i 10.188.62.59 0 100 AS# AS# AS# AS# i
* i 10.169.6.140 0 100 AS# AS# AS# AS# i
* i 10.188.62.73 0 100 AS# AS# AS# AS# i
* i 10.188.62.69 0 100 AS# AS# AS# AS# i
* i 10.188.62.67 0 100 AS# AS# AS# AS# i
* i 10.188.62.65 0 100 AS# AS# AS# AS# i
* i 10.188.62.49 0 100 AS# AS# AS# AS# i
* i 10.188.62.51 0 100 AS# AS# AS# AS# i
* i 10.188.62.53 0 100 AS# AS# AS# AS# i
* i 10.188.62.55 0 100 AS# AS# AS# AS# i
*>i 10.188.62.57 0 100 AS# AS# AS# AS# i
* i 10.188.62.75 0 100 AS# AS# AS# AS# i
* i 10.188.62.77 0 100 AS# AS# AS# AS# i
* i 10.188.62.79 0 100 AS# AS# AS# AS# i
Processed 2 prefixes, 25 paths
Step 3 & 4: Now dump the prefix using “show bgp <afi> <safi> / rd $rd $prefix, note the Last updated time.
Keep checking the “Last updated time” to see how fast does it update.
P.S This output below is from current state without churn, but using it as example:
sh bgp ipv4 unicast 10.192.32.2/31
Wed Oct 16 00:13:17.246 GMT
BGP routing table entry for 10.192.32.2/31
Versions:
Process bRIB/RIB SendTblVer
Speaker 13933445 13933445
Last Modified: Oct 11 04:21:00.797 for 4d19h << look how fast this will update as sign of churn
Paths: (3 available, best #1) << this will show the flapping best path
Advertised to update-groups (with more than one peer):
0.5 0.14
Path #1: Received by speaker 0
Advertised to update-groups (with more than one peer):
0.5 0.14
AS#
10.169.7.137 (metric 98166) from 10.169.2.202 (10.169.7.137)
Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 0, version 13933058
Community: AS#:140 AS#:1034
Originator: 10.169.7.137, Cluster list: 10.169.2.202
Path #2: Received by speaker 0
Not advertised to any peer
AS#
10.169.7.137 (metric 98166) from 10.169.2.203 (10.169.7.137)
Origin incomplete, metric 0, localpref 100, valid, internal
Received Path ID 0, Local Path ID 0, version 0
Community: AS#:140 AS#:1034
Originator: 10.169.7.137, Cluster list: 10.169.2.203
Path #3: Received by speaker 0
Advertised to update-groups (with more than one peer):
0.5
AS#
10.169.7.136 (metric 98168) from 10.169.9.154 (10.169.7.136)
Origin incomplete, localpref 100, valid, internal, add-path
Received Path ID 1285, Local Path ID 24, version 13933445
Community: AS#:140 AS#:1034
Originator: 10.169.7.136, Cluster list: 10.169.9.154
The next step is then to check "show route <prefix> detail" and "show bgp <prefix> detail" for the prefixes identified above to try to find the cause :
- are they advertised then withdrawn ?
- move from one next hop to another ?
- next hop is not reachable ?
- Egress interface flaps ?
- Some attributes are changing ?
You may need to track down the prefix to the router originating it and check the stability and attributes of the prefixes.
Stale bgp addpaths on neighbors after prefix removal from router
BGP advertisement issue with update-gen throttling/recovery
Fixed 6.1.4 onwards
Multiple addpaths with same nexthop are selected
Fixed 6.5.1 onwards
Reuse path IDs during add-paths change
Fixed 6.5.1 onwards
[ also needs CSCvj14223 as collateral]
Add-path: bgp crash in bgp_tblattr_pelem_walk (orphaned pelem post FO)
Fixed 6.5.1 onwards
Additional BU Recommendation
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: