cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2530
Views
0
Helpful
11
Replies

BGP takes too much CPU

Hello community,

since a some period of time my edge router (cisco7200 with NPE-2G) has a massive performance problem causing by BGP process.

rc-e100-49-69#sh processes cpu sorted | i BGP

204  1224991632 222733885       5499 14.07% 10.89% 10.71%   0 BGP Router      

192    28822184 202875594        142  0.23%  0.23%  0.23%   0 BGP I/O         

123      107552  13767727          7  0.00%  0.00%  0.00%   0 BGP Scheduler   

205   341805924   2176809     157024  0.00%  1.30%  1.71%   0 BGP Scanner     

206     3824380     15703     243544  0.00%  0.00%  0.00%   0 BGP Event       

it never goes down less then 10% and usualy staus by 15-20% sometimes gouse up to 40%.  The BGP CPU usage isn't  affected by the traffic load,  peaks are only about 250Mbit.

This router has 6 iBGP "full-BGP" peering session sitting in the same update group, a single eBGP session, and about 10 of iBGP peerings which are exchanging the small amount of internal prefixes (from 100 up to 1000 per session )

We have different locations with similuar topology, where the routers have pretty the same number of BGP sessions, but there the BGP process takes usually less then 5% Max and normally stays unde 1%.

The affected router has the following statistic:

c7200-G2#sh ip bgp summary  | i BGP activity    

BGP activity 2428861/2067338 prefixes, 310050497/306814805 paths, scan interval 60 secs

c7200-G2#sh ip bgp replication                  

                                                                    Current    Next

Index  Members          Leader       MsgFmt    MsgRepl     Csize    Version Version

    1        1                      <IP>            0          0    0/100           0/0        

    3        1                  <IP>        1443          0    0/100   211083495/0        

    4        4                 <IP>        1064       2040    0/1000  211083495/0        

    7        3                  <IP>       95559          0    0/1000  211083062/0        

    9        1                  <IP>           0          0    0/100   211083495/0        

   11        8                 <IP>     56065421  286329429    0/1000  211083495/0        

   12        1                <IP>     51130169          0    0/100   211083495/0        

   13        7                <IP>        1544       5250    0/1000  211083495/0        

   14        1                   <IP>          300          0    0/100   211083062/0        

   16        1                 <IP>          368          0    0/100   211083495/0        

   17        1                <IP>            0          0    0/100   211083495/0        

the problem that I could see is the group 11 and 12, for each group the router generates 56065421 messages! 

Routers in other locations generate 3 time less messages, and "BGP activity" show on them about 3 time less paths as well.

I'm on the end of my ideas, I've already consolidated the peer-groups and deleted all unnecesary peering, but without any noticeable effect.

11 Replies 11

andrew.prince
Level 10
Level 10

do you have "soft-reconfiguration inbound" on any peer statements?

hello Andrew,

yes, we use "soft-reconfiguration inbound"  for our external peering (with full BGP)

and for couple of customer peering without Full BGP

That feature has been known to use high memory and over work CPU's, as it makes copies of the BGP updates per peer is it configured against, so there would be no need to clear the BGP peer to make policy updates.  So if you have 10 peers, and they are send the full BGP routing table......well you get the idea. 

I believe a feature superceeded that one, I think it's "Soft Reset" which does the same thing...have a dig around.

HTH>

yes, you're right, but the problem, that other router with a pretty same number of peers have no such a problem.

I've just deleted all "soft-reconfiguration inbound" on iBGP and customers BGP, the only peering is external  - I' can't there delete it because I'll lose defenitive the connection for a couple of seconds, and that is not good

But I don't see any changes - CPU load is the same, and BGP router process takes the same 15-25% constantly.

UPDATE: Hmm, it seems that CPU usage by the "BGP router" process is really went down. it stays now constantly under 10% but the overall CPU load stays the same over 30-40%, and management traffic (ping the loopback) has a jitter, sometime upto 100ms from the local LAN

But I can't now see which process takes all performance.

Firstly making changes during production hours - is never a good idea, as configuration disruptions are not good.

Secondly you need to have a baseline to work from, since you are in production hours - perhaps the CPU usage is normal for this time period - have you monitored the CPU/Memory during a quite period?  Do you have anything to compare it to? Do you have historical records?

with a configuration changes - you're right! but in this case it was approved and acceptable

We have a statistic over a year, some months ago we got a light traffic increase and a CPU usage had increased as well, but there were no problem, but since a week our monitoring tool says taht

first - the ping takes too long, sometime about 100ms!

second - the graphs get the "holes" (like if the SNMP requests get timeout), but only for this router, we're monitoring over 100 routers and they don't have such a problem.

other bad thing - we'd like to take this router in our MPLS ring, but as I activate the "mpls ip" over the MPLS-ring  interface - get the router much more CPU load and the transit traffic (not only the management) strarts to jitter and packet loss.

I don't think that 200 Mbit of traffic bring NPE-G2 on it limit, it should be something else wrong there.

OK the snmp sounds like a timeout issue, you could configure the monitoring system to wait longer for the snmp get reply - but this does not fix the issue.

What has "changed" on the device since a week ago - do you have configuration management, can youn track any changes in the specific time period?

How much memory does the router have?

when was the last device reboot?

Do all your devices run the same version of IOS (could be a bug specific to memeory leak or CPU issues) ?

there were no any changes in configuration since a month or more, there were a kind of  "configuration stop" for this router.

but I know that a couple of BGP peers went online, they were for some time complete offline because the CPEs had HW failures, but last 2 weeks they went again online.

I'll check what else I can see.

The monitoring tool wasn't change as well. and it uses the same parameters for all routers, but I'll try to reinitialize the monitoring for this router.

Thank you for your a´nswers, I'm really apreciate it!

Hello,

the problem with "holes" in graphs was solved by increased of SNMP timeout. But the question is still why suddenly the default value bacame insufficient.

the CPU used by "BGP router" process went a little bit down,  but sometimes it uses upto 80% and together with "BGP scanner"  it pushes upto  100% and then the ping comes to 100-200 ms   and this still confuse me.

this router has NPE-G2 and 1GB memory ,  Version 12.2(31)SB13 (actually it was the updated, but 6 month ago)

other routers are NPE-G1 with 1BG , Version 12.2(31)SB18

I can't see any memory leak, the routers are more then 5-6 months online.

Andrew, you are right Its called "route refresh" capability

http://www.cisco.com/en/US/products/ps6599/products_data_sheet09186a0080087b3a.html#19969

HTH

Kishore

Hello,

if somebody still interested in, we've found the problem - it was BGP related "bugs" on  c7200p-a3jk91s-mz.122-31.SB13.bin

after upgrating it to the last version in SB train  -  c7200p-a3jk91s-mz.122-31.SB21.bin

all CPU related problems are gone.