01-27-2010 11:18 PM - edited 03-04-2019 07:19 AM
Hello everybody,
On the router we have about 170 BGP peers.
One day we have hit max-prefix for number of peers (about 50 peers).
This sessions were in Active/Idle state and we were asking the peers to reset them.
During that situation we had unnormal CPU load on the router:
#sh process cpu sorted
CPU utilization for five seconds: 100%/4%; one minute: 85%; five minutes: 78%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
468 471065972 92202393 5109 94.74% 72.16% 64.72% 0 BGP Router
But before it'd been something like this:
#sh process cpu sort
CPU utilization for five seconds: 21%/8%; one minute: 23%; five minutes: 22%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
145 695300536 675156784 1029 6.55% 2.54% 2.51% 0 BGP Router
We put the sessions in Administrative Down state and this action helped to reduce CPU load.
We had such situation even when just a few peers were in Active/Idle state.
As you can see our router is not so week as you can suppose:
#sh ver
Cisco IOS Software, c7600s72033_rp Software (c7600s72033_rp-ADVIPSERVICES-M), Version 12.2(33)SRD1, RELEASE SOFTWARE (fc4)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2009 by Cisco Systems, Inc.
Compiled Tue 24-Feb-09 23:34 by prod_rel_team
ROM: System Bootstrap, Version 12.2(17r)S4, RELEASE SOFTWARE (fc1)
BOOTLDR: Cisco IOS Software, c7600s72033_rp Software (c7600s72033_rp-ADVIPSERVICES-M), Version 12.2(33)SRD1, RELEASE SOFTWARE (fc4)
Router01 uptime is 26 weeks, 1 day, 15 hours, 34 minutes
Uptime for this control processor is 26 weeks, 1 day, 15 hours, 47 minutes
System returned to ROM by reload at 15:34:25 UTC Tue Feb 24 2009 (SP by reload)
System restarted at 17:05:40 UTC Tue Jul 28 2009
System image file is "bootdisk:c7600s72033-advipservices-mz.122-33.SRD1.bin"
Last reload type: Normal Reload
cisco CISCO7609 (R7000) processor (revision 1.2) with 983008K/65536K bytes of memory.
Processor board ID FOX103107NH
SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache
Last reset from s/w reset
5 Virtual Ethernet interfaces
100 Gigabit Ethernet interfaces
12 Ten Gigabit Ethernet interfaces
1917K bytes of non-volatile configuration memory.
8192K bytes of packet buffer memory.
65536K bytes of Flash internal SIMM (Sector size 512K).
Configuration register is 0x2102
What is the reason why we had so high CPU load during the situation like i described before?
What we should do to avoid it?
Thank you in advance!
--
Have a nice day,
Dmitry
Solved! Go to Solution.
01-28-2010 08:03 PM
Hi Dmitry,
Let's ignore the fact that your AS leaked the prefixes, since you also say this problem occurs anyway during a normal day of your network. I have an idea, but haven't actually used it in practice to be honest. My understanding is that your side continuously and actively tries to establish sessions and this might be causing this problem. There exists a BGP command "neighbor
http://www.cisco.com/en/US/docs/ios/iproute/command/reference/irp_bgp4.html#wp1043787
Kind Regards,
Maria
Edit: Ok, I just found this thread in Cisco NSP: http://www.gossamer-threads.com/lists/cisco/nsp/122263#122263
01-29-2010 01:27 AM
Hello Dmitry,
Maria has provided a possible remedy and a link where other people with similar device sup720 in similar scenario (many eBGP sessions in an internet exchange point like DECIX) saw high cpu usage when two or mode eBGP sessions are stucked at idle state.
However, the final obejctive is to setup eBGP sessions when the other peer is willing to accept /start the session.
if the command
neigh x.x.x.x transport connection passive
is given both sides session never come up.
so if a manual change is needed it becomes similar to using neigh ... shutdown
This is clearly an IOS software bug even if it is not possible to find an exact match, because bug
mentioned in Gossamer forums is related to an increase in memory usage by BGP over time caused by idle BGP sessions and not to a great cpu increase over a so extended time window.
I think a bug exists for this, but it is not visible outside Cisco (for example not visible for a Cisco partner like my account is)
I would suggest to open a Cisco service request, where you can get feedback on this.
Generally speaking I don't like the use/abuse of max-prefixes that is done in border routers at internet exchange points: putting limits very close to current number of prefixes requires constant work to adjust these thresholds.
I think it should be used to avoid to accept a full routing table from someone that should send only 10-20 prefixes but if we use limit 13 when current number of prefixes is 10 and then the peer gets new customers and so it tries to advertise 14 prefixes the session is turned down.
The command also allows a reaction of type warning only to be noted.
Just to say I see every day mails at job of IXP members advertising changes.
Of course it is easier to limit number of prefixes learned by a peer then to build an AS path access-list that reflects current customers of the peer.
Some years ago while studying BGP peering we tried to use the AS set concept: each ISP should mantain the AS SET object on RIPE RIR where a list of AS numbers that can appear in advertisements of the ISP appears.
Unfortunately we have seen by comparing these AS SETs objects with effective advertisements received in different european IXPs that they were not accurate: none used to remove ex customers ASN and some new ones did not appear.
Hope to help
Giuseppe
01-28-2010 12:22 AM
Hello Dmitry,
I should have looked at this thread before answering in the other one
if max-prefix triggers 50 BGP sessions to be turned down the router is challenged with loading again all prefixes on each session and this causes high cpu usage.
you can change the reaction to warning-only to avoid to have the sessions terminated.
170 BGP sessions are quite a number. You may consider to split these BGP sessions over multiple devices or to deploy route reflector servers
Hope to help
Giuseppe
01-28-2010 01:57 AM
Hello Giuseppe,
Thank you for reply. It looks like you understood me incorrectly. Our AS leaked the number of the prefixes to the peers and the sessions were blocked on the remote side, not on our side by max-prefix trigger.
--
Thank you anyway,
Dmitry
01-28-2010 02:34 AM
Hello Dmitry,
sorry for this little misunderstanding
regardless of what side terminated the sessions all of them are restarted and the loading phase starts with the high cpu usage you have seen
On your side you can only configure a route-filter that explicitly permits only expected IP subnets.
Hope to help
Giuseppe
01-28-2010 03:17 AM
Hello Giuseppe,
such situation (high CPU load) could be during a whole day while 15 direct eBGP peers still in Active/Idle state.
just someone can reset the session after a few days... so during a day the sessions will be in Active/Idle state, between our AS will be no sendig any Update messages.
Why just a few Down sessions could be a source of so high CPU load during all time when they are Active or Idle?
--
Have a nice day,
Dmitry
01-28-2010 08:03 PM
Hi Dmitry,
Let's ignore the fact that your AS leaked the prefixes, since you also say this problem occurs anyway during a normal day of your network. I have an idea, but haven't actually used it in practice to be honest. My understanding is that your side continuously and actively tries to establish sessions and this might be causing this problem. There exists a BGP command "neighbor
http://www.cisco.com/en/US/docs/ios/iproute/command/reference/irp_bgp4.html#wp1043787
Kind Regards,
Maria
Edit: Ok, I just found this thread in Cisco NSP: http://www.gossamer-threads.com/lists/cisco/nsp/122263#122263
01-29-2010 02:15 AM
Maria,
thank you very much!
it is a well-know bug: CSCsy40775
--
Have a nice day,
Dmitry
01-29-2010 01:27 AM
Hello Dmitry,
Maria has provided a possible remedy and a link where other people with similar device sup720 in similar scenario (many eBGP sessions in an internet exchange point like DECIX) saw high cpu usage when two or mode eBGP sessions are stucked at idle state.
However, the final obejctive is to setup eBGP sessions when the other peer is willing to accept /start the session.
if the command
neigh x.x.x.x transport connection passive
is given both sides session never come up.
so if a manual change is needed it becomes similar to using neigh ... shutdown
This is clearly an IOS software bug even if it is not possible to find an exact match, because bug
mentioned in Gossamer forums is related to an increase in memory usage by BGP over time caused by idle BGP sessions and not to a great cpu increase over a so extended time window.
I think a bug exists for this, but it is not visible outside Cisco (for example not visible for a Cisco partner like my account is)
I would suggest to open a Cisco service request, where you can get feedback on this.
Generally speaking I don't like the use/abuse of max-prefixes that is done in border routers at internet exchange points: putting limits very close to current number of prefixes requires constant work to adjust these thresholds.
I think it should be used to avoid to accept a full routing table from someone that should send only 10-20 prefixes but if we use limit 13 when current number of prefixes is 10 and then the peer gets new customers and so it tries to advertise 14 prefixes the session is turned down.
The command also allows a reaction of type warning only to be noted.
Just to say I see every day mails at job of IXP members advertising changes.
Of course it is easier to limit number of prefixes learned by a peer then to build an AS path access-list that reflects current customers of the peer.
Some years ago while studying BGP peering we tried to use the AS set concept: each ISP should mantain the AS SET object on RIPE RIR where a list of AS numbers that can appear in advertisements of the ISP appears.
Unfortunately we have seen by comparing these AS SETs objects with effective advertisements received in different european IXPs that they were not accurate: none used to remove ex customers ASN and some new ones did not appear.
Hope to help
Giuseppe
01-29-2010 02:15 AM
Giuseppe,
thank you very much!
it is a well-know bug: CSCsy40775
--
Have a nice day,
Dmitry
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide