Noemi,It has been a pleasure

noemi.berry · ‎05-20-2015

We'd like to filter BGP routes based on prefix length, using a route-map, and the first attempt to do this brought up some questions.

The config excerpt (shortened, edited), looks like this:

router bgp 66666

neighbor 2.2.2.2 route-map ISP-to-ME in

route-map ISP-to-ME permit 10

match ip address prefix-list bgp-inbound

ip prefix-list bgp-inbound seq 10 deny 0.0.0.0/0 ge 23 <----- deny /23 or longer

ip prefix-list bgp-inbound seq 20 permit 0.0.0.0/0 le 22 <---- allow /22 and shorter

Before applying this, "sh ip bgp sum" looked like we're carrying full tables (532727 network entries):

Router#show ip bgp sum

...

BGP table version is 435670814, main routing table version 435670814

532727 network entries using 70319964 bytes of memory

1003897 path entries using 56218232 bytes of memory

After applying the route-map config, "show ip bgp sum" still had 532K network entries, and "show ip route" still showed /24s still in the routing table. Later we realize this was probably due to not clearing the peer properly (link-flapping for other testing), and there was a typo in the neighbor IP at one point.

But meantime the route-map itself showed the prefix-list filter, that is, "show ip bgp route-map ISP-to-ME" didn't contain any /24 routes, but "show ip bgp" and "show ip route" did, and "show ip bgp sum" still showed 532K routes. So it's like the prefix filter wasn't being applied to the routing table. Could this have been because not clearing the peer?

The really strange thing is that during this limbo-time, we started to receive alerts from some sites (pingdom probes) who couldn't reach us. Later digging showed that those sites were among the little /23s and /24s that had been filtered out -- and we'd omitted a default route to return to them, oops! But meantime the routing table still showed all those /23s and /24s. How is it possible that the routing acted like the filters were being applied, but only the route-map had the truncated set of routes (minus the default route, oops). The routing table ("sh ip route") is the final source of truth on routing, right? There is no route-map post-processing on the routing table, is there?

Later after the typo was corrected and peer was cleared, the routing table did reflect the filters (so, no /23s and /24s) -- but "show ip bgp sum" still showed 532K network entries. Is this expected? (unfortunately I didn't capture "sh ip route sum").

In sum, questions are:

- Why would "sh ip bgp route-map" show that the filters were working (no /23s /24s), but not be applied to the routing table ? Could this be because the peer wasn't cleared correctly?

- Why would "sh ip bgp sum" still show 532K 'network entries' when the routing table has clearly been reduced by the prefix-length filter?

- Is it at all possible that our filters really were being applied (and without the default route, oops), causing us to have an incomplete routing table and hence 'outages' reaching certain sites -- even though our routing table still carried /23s and /24s (full tables) ?

- Finally, is the config above correct for filtering /23s and /24s?

I'm not counting or remembering something right here, or am missing something about the processing order. We need to make sure these route-maps with the prefix-length filters are correct before we re-apply (no lab unfortunately).

Thanks for any guidance.

Peter Paluch · ‎05-20-2015

Hi,

You have an interesting issue at hand.

The fact that you kept seeing the routes in the show ip bgp output even after filtering out the prefixes slightly suggests that you may be running BGP Soft Reconfiguration. Do you happen to have a similar line somewhere in your BGP config?

neighbor 2.2.2.2 soft-reconfiguration inbound

This command causes the router to always keep a full, unfiltered database of prefixes received from the neighbor so that whenever the inbound policy is changed, instead of dropping and reestablishing the peering, the unfiltered database is simply "distilled" through the new policy to obtain the new resulting set of routes. The presence of these unfiltered routes could have been the cause of the fact that despite your route-map, your BGP table still held over 500K routes. Of course, the filtered routes are actually the ones that will be installed into the routing table and advertised further; the unfiltered database is kept locally only for the sole purpose of filtering it anew and anew as the inbound policy is changed.

Please note that the Soft Reconfiguration is obsolete and should not be used anymore. It was necessary back in times when BGP did not have a way of requesting a re-sending of all routes from a particular peer. However, since RFC 2918 which is 15 years old by now, BGP has a feature called Route Refresh that allows a BGP router to ask its peer to send again all routes of a particular address family. The use of this Route Refresh capability is dynamically negotiated and does not need to be configured in any way. In fact, configuring the Soft Reconfiguration will effectively disable the Route Refresh, losing all its advantages.

The fact that you did not see those /23s, /24s, etc. in show ip bgp route-map name is fairly simply explained: This show command shows you exactly those routes that match the route-map, but the routes that are not displayed may be missing because of two reasons: Either they're not in the BGP RIB at all, or they are there but they do not pass the route-map test.

The really strange thing is that during this limbo-time, we started to receive alerts from some sites (pingdom probes) who couldn't reach us. Later digging showed that those sites were among the little /23s and /24s that had been filtered out -- and we'd omitted a default route to return to them, oops! But meantime the routing table still showed all those /23s and /24s. How is it possible that the routing acted like the filters were being applied, but only the route-map had the truncated set of routes (minus the default route, oops). The routing table ("sh ip route") is the final source of truth on routing, right? There is no route-map post-processing on the routing table, is there?

This is extremely difficult to explain without any further details. Routes filtered by a route-map (even though visible in the show ip bgp if Soft Reconfiguration is activated) should not make it into the routing table. However, what I am thinking about is the fact that the 500K routes is an awful lot of entries. It takes considerable time to update the routing table that contains so many entries, and in addition, the routing table is actually only a master copy of routing information that is, on Cisco routers, further compiled into the CEF FIB. All of this takes time - having BGP re-filter its RIB, installing the changed routes into the the routing table (add the missing ones, remove the surplus ones), compiling the changed routing table into CEF FIB. While this may be just me making up an explanation, it is possible that this "limbo-time" overlapped with the time the router was busy updating the BGP RIB, router RIB and router FIB, and these temporarily got out of sync so what you saw in show ip route wasn't necessarily what was installed in show ip cef. And to a Cisco router, the ultimate routing knowledge is the CEF FIB, not the router RIB.

- Why would "sh ip bgp route-map" show that the filters were working (no /23s /24s), but not be applied to the routing table ? Could this be because the peer wasn't cleared correctly?

Yes, possibly. The clear ip bgp 2.2.2.2 in should have been issued to force the router to re-apply the inbound policy to the routing information received from that particular neighbor. If Soft Reconfig is configured for that neighbor, the unfiltered database will be processed again. Otherwise, if the neighbors have negotiated the Route Refresh capability, the neighbor will be asked to resend its routes.

Also, this could have been the time during which the changed inbound policy was being applied to the unfiltered database if Soft Reconfig was configured, and the router did not get to update its routing table yet.

- Why would "sh ip bgp sum" still show 532K 'network entries' when the routing table has clearly been reduced by the prefix-length filter?

This would be a typical symptom of Soft Reconfiguration configured. If no such thing is configured then we may be hitting a bug.

- Is it at all possible that our filters really were being applied (and without the default route, oops), causing us to have an incomplete routing table and hence 'outages' reaching certain sites -- even though our routing table still carried /23s and /24s (full tables) ?

You would need to prove beyond any doubt that all the /23s and /24s - of which there must have been thousands! - were still there. The router could have been in process of purging those networks - some could have still been in the routing table, some others may already have been removed.

But principially - no, such a thing is not possible. The data in the routing table determine which locations are reachable and which are not. It is not possible to have a route in the routing table and be unable to route packets toward it. However, what we're dealing with here is a transitory state in which hundreds of thousands of entries in RIB and FIB needed to be updated, and during that time, RIB and FIB may not have been synchronized (even though, of course, this process should not take hours!)

- Finally, is the config above correct for filtering /23s and /24s?

Yes, in fact, it will filter out all /23s and above.

Best regards,
Peter

noemi.berry · ‎05-21-2015

By golly, there is soft-reconfiguration inbound defined, a leftover. That explains it! I'd seen that it was deprecated but hadn't tackled it. That would completely explain the apparently disparate "network entries" in the BGP tables even though the route-map was clearly applying the filters.

The limbo-state when the routing table clearly was carrying /23s and /24s (obvious from just the first few entries with show ip route, but no way to tell if it was carrying all of them) -- I think could be explained by a combination of:

- Not resetting peers properly (we were hard-downing ISP uplinks for failover testing at the same time, not the best but that's how it went)

- A typo in the neighbor 2.2.2.2 route-map command on Router2 meant that the route-map wasn't even configured, but Router1 did have the route-map (they exchange eBGP routes via iBGP). So when the ISP uplink on Router1 was shut down, its routing table did show the route-map filters applied, I guess because an iBGP peer doesn't need to be re-set.....? Yet the outages started, even though Router2 was still carrying full routes (thanks to the typo), and only its ISP uplink was up.

- Not remembering everything that happened and way too many people on the conference call to do methodical troubleshooting.

>>

The really strange thing is that during this limbo-time, we started to receive alerts from some sites (pingdom probes) who couldn't reach us. ...The routing table ("sh ip route") is the final source of truth on routing, right? There is no route-map post-processing on the routing table, is there?

This is extremely difficult to explain without any further details.

>>

Waitaminnit..... I have it! This is why: This is one of those so subtle yet SO OBVIOUS things, I could kick myself. Networking 101: What was the outbound path?

Router1 and Router2 each have their own ISPs, and have an iBGP link between them. Both have southward-facing interfaces that are HSRP'd with a single VIP, with Router1 the owner.

The routers send their inbound traffic downstream to a firewall for NAT, zoning, etc. All the testing and pinging we were doing was to static NAT addresses on the firewall, and one hard IP on its outside interface. From the firewall, return traffic to the Internet always goes out via Router1 because it owns the VIP.

When Router1's ISP uplink was shut down, its routes all pointed to the iBGP link to Router2 -- but because of Router1's route-map's prefix filter, and the missing default route (oops), Router1 had an incomplete table. So traffic back to certain /23 /24 destinations from our firewall never made it past Router1 to the iBGP link to Router2. The fact that Router2 was still carrying full routes (thanks to the typo) was a red herring -- the return traffic never made it that far. About an hour into this, I found the typo on Router2 and fixed it, but it didn't matter.

So I believe the answer to:

How is it possible that the routing acted like the filters were being applied, but only the route-map had the truncated set of routes (minus the default route, oops).

-- Because I wasn't looking at the first router in the path for outbound traffic, which wasn't carrying full routes.

And -- wow -- the routers don't have HSRP tracking to the ISP interfaces. If Router1's ISP link goes down, HSRP should fail over and give the VIP to Router2 for outbound traffic, but its current configuration without HSRP tracking, outbound traffic always goes via Router1, regardless of its ISP's uplink's state.

>>

- Finally, is the config above correct for filtering /23s and /24s?

Yes, in fact, it will filter out all /23s and above.

>>

good, thanks. Is this still viewed as best practice, or should we shorten it more? Looks like about ~18K routes that seems manageable (and should reduce convergence time, yes?). These routers are only 3945s, they really don't need full tables.

Between soft reconfig, no HSRP tracking, a route-map typo, hard peer resets, and missing a default route, this was quite the perfect storm!

I can't thank you enough for unlocking this mystery, it was really a hard problem! Or it seemed like it until understanding what happened :) . I really appreciate the detailed answer -- and our next maintenance to restore all this will go much more smoothly!

thanks,

noemi

Peter Paluch · ‎05-21-2015

Hi Noemi,

I am glad I could help. Just by the way, your original post suggested a different picture of your network, much simpler than what you have described right now. I suppose that has confused many of us while trying to unravel the mystery.

Regarding the Soft Reconfig - I hope I haven't been too obnoxious in explaining it. Being over 5 years on these forums, I am seeing the soft-reconfig inbound configured over and over again. It is amazing how a former best-practice or white-paperized configuration snippet made it into an ineradicable and almost mindlessly configured fossil. So whenever I come across this thing, I go to great lengths to explain why it shouldn't be used in the first place, hoping that more and more people will take the advice as they come across these posts.

good, thanks. Is this still viewed as best practice, or should we shorten it more?

Quite honestly, I do not know what the current best practice here is. My question, though, is: Do you actually need any more specific routes than a default route toward your upstream routers? Why do you actually need to know any specifics if the filtered-out prefixes can still be reached via a default route equivalently?

If just some sort of load splitting is called for, you could still deaggregate the default route on your individual border routers, e.g. into /1 (2 prefixes), /2 (4 prefixes), /3 (8 prefixes), or /4 (16 prefixes) and make sure each border router handles some subset of tem.

Best regards,
Peter

noemi.berry · ‎05-26-2015

Actually, as recently as a few weeks ago, one of our ISP's support engineers recommended using soft-reconfiguration inbound , so it seems this legacy lives on. I am completely sold on removing it!

Sorry for omitting the detail, it's always hard to know how much to put in forum-posts -- I try to keep it simple and focused, but then things get missed. And I hadn't realized yet how the downstream architecture would factor in.

Our setup is basic and standard, nothing unusual: 2 ISPs, iBGP between the two; each router has a downstream interface, HSRP'd with a VIP, facing a firewall with an outside interface, all numbered out of public space. The firewall has a default route pointing up to the routers' VIP for ingress, then then the routers do their thing (with full tables right now!) for egress into the great big ether-cloud of the Internet.

The BGP setup was bare-bones and hadn't been touched since initial setup. The only reason we touched it was to add another /24 to our announcements, and that part went fine. While we were at it, we decided to update the BGP config to match another data center, which prefix-filters to /22s and uses route-policy (HP routers), and few other standard best-practices. So goes the history.

Do you actually need any more specific routes than a default route toward your upstream routers? Why do you actually need to know any specifics if the filtered-out prefixes can still be reached via a default route equivalently?

So, right, excellent questions. It's clear we need to revisit the overall objectives. How do we want our traffic flow to look? Should it behave like "active/passive"? (I hesitate to use that terminology because then people think "HA'd firewalls" but that was the reigning belief.) Should we make an attempt to influence inbound traffic or let the Internet do its thing? Are both our ISPs created equal?

As much fun as it is to play BGP router-jock, the needs of the business and applications are at the forefront. For us, availability is top #1 priority -- short convergence time, no outages during ISP maintenances . Our maintenance windows few and far between and we need minimal tweaks/touches to the BGP config. Bandwidth needs are low, but availability needs are super-high. So need to translate our business / operational needs into BGP configs, as simply and robustly and safely as possible, and you're asking the very questions I'm going to be digging into. Ultimately the answer will be simple, but it will all have been carefully considered.

So far I'm not seeing any compelling reason not to carry /22 and shorter prefixes -- with a static default to cover the filtered ones this time!, unless failover testing reveals a problem like longer convergence time.

I can't thank you enough for the guidance, I never thought I'd figure this out!

best regards,
noemi

Peter Paluch · ‎05-26-2015

Noemi,

It has been a pleasure and an honor! I hope we will see you posting on these forums again!

Best regards,
Peter

Using route-map for BGP inbound prefix filtering by length