BGP takes too much CPU

Konstantin Dunaev · ‎07-29-2011

Hello community,

since a some period of time my edge router (cisco7200 with NPE-2G) has a massive performance problem causing by BGP process.

rc-e100-49-69#sh processes cpu sorted | i BGP

204 1224991632 222733885 5499 14.07% 10.89% 10.71% 0 BGP Router

192 28822184 202875594 142 0.23% 0.23% 0.23% 0 BGP I/O

123 107552 13767727 7 0.00% 0.00% 0.00% 0 BGP Scheduler

205 341805924 2176809 157024 0.00% 1.30% 1.71% 0 BGP Scanner

206 3824380 15703 243544 0.00% 0.00% 0.00% 0 BGP Event

it never goes down less then 10% and usualy staus by 15-20% sometimes gouse up to 40%. The BGP CPU usage isn't affected by the traffic load, peaks are only about 250Mbit.

This router has 6 iBGP "full-BGP" peering session sitting in the same update group, a single eBGP session, and about 10 of iBGP peerings which are exchanging the small amount of internal prefixes (from 100 up to 1000 per session )

We have different locations with similuar topology, where the routers have pretty the same number of BGP sessions, but there the BGP process takes usually less then 5% Max and normally stays unde 1%.

The affected router has the following statistic:

c7200-G2#sh ip bgp summary | i BGP activity

BGP activity 2428861/2067338 prefixes, 310050497/306814805 paths, scan interval 60 secs

c7200-G2#sh ip bgp replication

Current Next

Index Members Leader MsgFmt MsgRepl Csize Version Version

1 1 <IP> 0 0 0/100 0/0

3 1 <IP> 1443 0 0/100 211083495/0

4 4 <IP> 1064 2040 0/1000 211083495/0

7 3 <IP> 95559 0 0/1000 211083062/0

9 1 <IP> 0 0 0/100 211083495/0

11 8 <IP> 56065421 286329429 0/1000 211083495/0

12 1 <IP> 51130169 0 0/100 211083495/0

13 7 <IP> 1544 5250 0/1000 211083495/0

14 1 <IP> 300 0 0/100 211083062/0

16 1 <IP> 368 0 0/100 211083495/0

17 1 <IP> 0 0 0/100 211083495/0

the problem that I could see is the group 11 and 12, for each group the router generates 56065421 messages!

Routers in other locations generate 3 time less messages, and "BGP activity" show on them about 3 time less paths as well.

I'm on the end of my ideas, I've already consolidated the peer-groups and deleted all unnecesary peering, but without any noticeable effect.

andrew.prince · ‎07-29-2011

do you have "soft-reconfiguration inbound" on any peer statements?

Konstantin Dunaev · ‎07-29-2011

hello Andrew,

yes, we use "soft-reconfiguration inbound" for our external peering (with full BGP)

and for couple of customer peering without Full BGP

andrew.prince · ‎07-29-2011

That feature has been known to use high memory and over work CPU's, as it makes copies of the BGP updates per peer is it configured against, so there would be no need to clear the BGP peer to make policy updates. So if you have 10 peers, and they are send the full BGP routing table......well you get the idea.

I believe a feature superceeded that one, I think it's "Soft Reset" which does the same thing...have a dig around.

HTH>

Konstantin Dunaev · ‎07-29-2011

yes, you're right, but the problem, that other router with a pretty same number of peers have no such a problem.

I've just deleted all "soft-reconfiguration inbound" on iBGP and customers BGP, the only peering is external - I' can't there delete it because I'll lose defenitive the connection for a couple of seconds, and that is not good

But I don't see any changes - CPU load is the same, and BGP router process takes the same 15-25% constantly.

UPDATE: Hmm, it seems that CPU usage by the "BGP router" process is really went down. it stays now constantly under 10% but the overall CPU load stays the same over 30-40%, and management traffic (ping the loopback) has a jitter, sometime upto 100ms from the local LAN

But I can't now see which process takes all performance.

andrew.prince · ‎07-29-2011

Firstly making changes during production hours - is never a good idea, as configuration disruptions are not good.

Secondly you need to have a baseline to work from, since you are in production hours - perhaps the CPU usage is normal for this time period - have you monitored the CPU/Memory during a quite period? Do you have anything to compare it to? Do you have historical records?

Konstantin Dunaev · ‎07-29-2011

with a configuration changes - you're right! but in this case it was approved and acceptable

We have a statistic over a year, some months ago we got a light traffic increase and a CPU usage had increased as well, but there were no problem, but since a week our monitoring tool says taht

first - the ping takes too long, sometime about 100ms!

second - the graphs get the "holes" (like if the SNMP requests get timeout), but only for this router, we're monitoring over 100 routers and they don't have such a problem.

other bad thing - we'd like to take this router in our MPLS ring, but as I activate the "mpls ip" over the MPLS-ring interface - get the router much more CPU load and the transit traffic (not only the management) strarts to jitter and packet loss.

I don't think that 200 Mbit of traffic bring NPE-G2 on it limit, it should be something else wrong there.

andrew.prince · ‎07-29-2011

OK the snmp sounds like a timeout issue, you could configure the monitoring system to wait longer for the snmp get reply - but this does not fix the issue.

What has "changed" on the device since a week ago - do you have configuration management, can youn track any changes in the specific time period?

How much memory does the router have?

when was the last device reboot?

Do all your devices run the same version of IOS (could be a bug specific to memeory leak or CPU issues) ?

Konstantin Dunaev · ‎07-29-2011

there were no any changes in configuration since a month or more, there were a kind of "configuration stop" for this router.

but I know that a couple of BGP peers went online, they were for some time complete offline because the CPEs had HW failures, but last 2 weeks they went again online.

I'll check what else I can see.

The monitoring tool wasn't change as well. and it uses the same parameters for all routers, but I'll try to reinitialize the monitoring for this router.

Thank you for your a´nswers, I'm really apreciate it!

Konstantin Dunaev · ‎08-01-2011

Hello,

the problem with "holes" in graphs was solved by increased of SNMP timeout. But the question is still why suddenly the default value bacame insufficient.

the CPU used by "BGP router" process went a little bit down, but sometimes it uses upto 80% and together with "BGP scanner" it pushes upto 100% and then the ping comes to 100-200 ms and this still confuse me.

this router has NPE-G2 and 1GB memory , Version 12.2(31)SB13 (actually it was the updated, but 6 month ago)

other routers are NPE-G1 with 1BG , Version 12.2(31)SB18

I can't see any memory leak, the routers are more then 5-6 months online.

Kishore Chennupati · ‎01-19-2012

Andrew, you are right Its called "route refresh" capability

http://www.cisco.com/en/US/products/ps6599/products_data_sheet09186a0080087b3a.html#19969

HTH

Kishore

Konstantin Dunaev · ‎01-19-2012

Hello,

if somebody still interested in, we've found the problem - it was BGP related "bugs" on c7200p-a3jk91s-mz.122-31.SB13.bin

after upgrating it to the last version in SB train - c7200p-a3jk91s-mz.122-31.SB21.bin

all CPU related problems are gone.