10-17-2013 05:49 AM
Hello everyone!
Recently have encountered such problem.
SNMP requests are overloading CPU for more than 50%. Can't figure out what's is the reason of such behaviour. I could blame SNMP manager for bombing the ASR with a lot of requests, but I doubt that, because, the other ASR is requested the same manner, and there are no overloads observed.
Does anyone have been encountered same problem ? What information should be gathered for deeper analisys?
Any help would be appriciated!
Here is some output from show commands.
RP/0/RSP1/CPU0:ASR9k-2#sh proc cpu loc 0/rSP1/CPU0
Thu Oct 17 13:51:02.903 EEDT
CPU utilization for one minute: 53%; five minutes: 22%; fifteen minutes: 12%
-------------------//--------------------
53292 3% 2% 2% eth_server
233558 14% 5% 2% gsp
233569 9% 3% 2% sysdb_mc
286885 8% 2% 1% sysdb_shared_nc
311600 10% 2% 1% hsrp <------- (?)
103694642 0% 0% 0% snmpd
103694659 0% 0% 0% mibd_entity
103694660 0% 0% 0% mibd_infra
103694661 4% 2% 2% mibd_interface
103694662 0% 0% 0% mibd_route
104173848 0% 0% 0% 104173848_unknown
104173920 1% 1% 1% 104173920_unknown
RP/0/RSP1/CPU0:ASR9k-2#sh snmp trace requests
Thu Oct 17 13:50:20.017 EEDT
Entering snmp_ltrace main....
2107 wrapping entries (2112 possible, 0 filtered, 84574100 total)
-------------------//--------------------
Oct 17 13:49:22.830 snmp/requests 0/RSP1/CPU0 84477507# t1 Processing PDU from 10.20.2.65,54654 req_id = 1469213559 (657ms on Q), type GETBULK
Oct 17 13:49:23.174 snmp/requests 0/RSP1/CPU0 84456738# t13 Rx PDU from 10.20.2.65,54654 len = 52 [Q = 1]
Oct 17 13:49:24.175 snmp/requests 0/RSP1/CPU0 84435969# t13 Rx PDU from 10.20.2.65,54654 len = 52 [Q = 2]
Oct 17 13:49:24.686 snmp/requests 0/RSP1/CPU0 84548866# t8 Discarding response tp[5] - peerid[2] serial[2991] val1[11] val2[29306011]
Oct 17 13:49:24.686 snmp/requests 0/RSP1/CPU0 t1 Processing PDU from 10.20.2.65,54654 req_id = 1469213559 (1511ms on Q), type GETBULK
Oct 17 13:49:25.176 snmp/requests 0/RSP1/CPU0 t13 Rx PDU from 10.20.2.65,54654 len = 52 [Q = 2]
Oct 17 13:49:26.177 snmp/requests 0/RSP1/CPU0 t13 Rx PDU from 10.20.2.65,54654 len = 52 [Q = 3]
Oct 17 13:49:26.462 snmp/requests 0/RSP1/CPU0 84475437# t8 Discarding response tp[5] - peerid[2] serial[2992] val1[12] val2[29306012]
Oct 17 13:49:26.462 snmp/requests 0/RSP1/CPU0 84454668# t1 Processing PDU from 10.20.2.65,54654 req_id = 1469213559 (2286ms on Q), type GETBULK
Oct 17 13:49:26.462 snmp/requests 0/RSP1/CPU0 84433899# t1 Recovering timedout handle - serial[2993] mibds[1] pending[1] time[29306013]
Oct 17 13:49:26.462 snmp/requests 0/RSP1/CPU0 84332124# t1 Processing PDU from 10.20.2.65,54654 req_id = 1469213559 (1285ms on Q), type GETBULK
Oct 17 13:49:27.178 snmp/requests 0/RSP1/CPU0 84563446# t13 Rx PDU from 10.20.2.65,54654 len = 52 [Q = 2]
Oct 17 13:49:28.212 snmp/requests 0/RSP1/CPU0 69026775# t13 Rx PDU from 10.20.2.65,36383 len = 52 [Q = 3]
Oct 17 13:49:28.229 snmp/requests 0/RSP1/CPU0 84413130# t8 Discarding response tp[1] - peerid[2] serial[2993] val1[38] val2[15]
Oct 17 13:49:29.213 snmp/requests 0/RSP1/CPU0 84473367# t13 Rx PDU from 10.20.2.65,36383 len = 52 [Q = 4]
Oct 17 13:49:30.126 snmp/requests 0/RSP1/CPU0 84452598# t8 Discarding response tp[5] - peerid[2] serial[2994] val1[13] val2[29306014]
Oct 17 13:49:30.127 snmp/requests 0/RSP1/CPU0 84350823# t1 Processing PDU from 10.20.2.65,54654 req_id = 1469213559 (3949ms on Q), type GETBULK
Oct 17 13:49:30.127 snmp/requests 0/RSP1/CPU0 84431829# t1 Recovering timedout handle - serial[2995] mibds[1] pending[1] time[29306015]
Oct 17 13:49:30.127 snmp/requests 0/RSP1/CPU0 84330054# t1 Processing PDU from 10.20.2.65,54654 req_id = 1469213559 (2948ms on Q), type GETBULK
Oct 17 13:49:30.127 snmp/requests 0/RSP1/CPU0 84557209# t1 Recovering timedout handle - serial[2996] mibds[1] pending[1] time[29306016]
Oct 17 13:49:30.127 snmp/requests 0/RSP1/CPU0 84492066# t1 Processing PDU from 10.20.2.65,36383 req_id = 1461946883 (1914ms on Q), type GETBULK
Oct 17 13:49:30.215 snmp/requests 0/RSP1/CPU0 84471297# t13 Rx PDU from 10.20.2.65,36383 len = 52 [Q = 2]
Oct 17 13:49:31.215 snmp/requests 0/RSP1/CPU0 84450528# t13 Rx PDU from 10.20.2.65,36383 len = 52 [Q = 3]
Oct 17 13:49:31.812 snmp/requests 0/RSP1/CPU0 84348753# t8 Discarding response tp[1] - peerid[2] serial[2995] val1[39] val2[15]
Oct 17 13:49:32.217 snmp/requests 0/RSP1/CPU0 84429759# t13 Rx PDU from 10.20.2.65,36383 len = 52 [Q = 4]
Oct 17 13:49:33.218 snmp/requests 0/RSP1/CPU0 84327984# t13 Rx PDU from 10.20.2.65,36383 len = 52 [Q = 5]
Oct 17 13:49:33.619 snmp/requests 0/RSP1/CPU0 53781621# t8 Discarding response tp[1] - peerid[2] serial[2996] val1[40] val2[15]
Oct 17 13:49:34.252 snmp/requests 0/RSP1/CPU0 84550972# t13 Rx PDU from 10.20.2.65,37437 len = 52 [Q = 6]
Oct 17 13:49:35.000 snmp/requests 0/RSP1/CPU0 84469227# t1 Recovering timedout handle - serial[2997] mibds[1] pending[1] time[29306018]
Oct 17 13:49:35.000 snmp/requests 0/RSP1/CPU0 84448458# t1 Processing PDU from 10.20.2.65,36383 req_id = 1461946883 (5786ms on Q), type GETBULK
Oct 17 13:49:35.000 snmp/requests 0/RSP1/CPU0 84346683# t1 Recovering timedout handle - serial[2998] mibds[1] pending[1] time[29306020]
Oct 17 13:49:35.000 snmp/requests 0/RSP1/CPU0 84427689# t1 Processing PDU from 10.20.2.65,36383 req_id = 1461946883 (4784ms on Q), type GETBULK
Oct 17 13:49:35.253 snmp/requests 0/RSP1/CPU0 84325914# t13 Rx PDU from 10.20.2.65,37437 len = 52 [Q = 5]
Oct 17 13:49:35.357 snmp/requests 0/RSP1/CPU0 68474877# t8 Discarding response tp[1] - peerid[2] serial[2997] val1[41] val2[15]
Oct 17 13:49:36.254 snmp/requests 0/RSP1/CPU0 84487926# t13 Rx PDU from 10.20.2.65,37437 len = 52 [Q = 6]
Oct 17 13:49:37.150 snmp/requests 0/RSP1/CPU0 84544735# t8 Discarding response tp[1] - peerid[2] serial[2998] val1[42] val2[15]
Oct 17 13:49:37.255 snmp/requests 0/RSP1/CPU0 84446388# t13 Rx PDU from 10.20.2.65,37437 len = 52 [Q = 7]
 
					
				
		
10-17-2013 08:04 AM
it could be that the snmp management station is polling aggressively some MIB that is slow (or performing a walk on a slow OID).
check the snmp architecture document we have on the support forums for some detail in identifying the culprit and maybe there is a good suggestion there for the slow OID.
Potentially increasing the timeout on the requests may help
and verification of the smu currency on the released snmp (performance) smu's (depending on release)
regards
xander
10-25-2013 05:38 AM
Thank You Alex for Your reply!
Yes, I have several times checked the document that you are reffering to. I belive it's this one that you are talking about- https://supportforums.cisco.com/docs/DOC-31331 By the way thank you, these explanations that you are writing are very useful!
Ok, back to business.
1) According to snmp management system (observium), the "slow oids" are interfaces oids. Overall SNMP request-response time takes about 110 seconds, where interface oids are being requested for 77 seconds.
2) sh snmp trace slow oids also shows that about 1s is necessary to process some interface oids.
3) We have ver. 4.2.3 running on our ASR. I have checked that there are available several SMU's regarding mib_interfaces and snmpd processes. We had also noticed that both above mentioned processes have crashed few times. So probably the solution would be to install these patches.
But, there is also one but.
4)We have launched tcpdumps on server for ASR-1 and ASR-2, the SNMP manager sends same number of requests for both devices, but for some reason only asr-2 have these problems with cpu overloading.
So I am curious, why two identical devices, having same software and running same config are behaving differently...
 
					
				
		
10-25-2013 05:52 AM
Hi Marks,
thanks for your comment! and for this problem let me see,
for sure in 423 we uncovered a lot of performance issues and were subsequently fixed, along with the crashes referenced.
so for sure 423 usage with the right smu's is the first step.
Next we made subsequent improvemetns in 43. currently 432 is the running version but 434 coming out in Dec has more perf improvements and all that.
We had recognized that snmp fell short to expectations so a lot of effort and focus was put on it with immediate results in 432 and comign in 434.
Why problem seen on one device and not another; ah so many possibilities, interface count, process business, interface or linecard organziation. I dont find it a worry.
cheers!
xander
10-25-2013 06:02 AM
ookay, so I believe we'll stick to sw upgrade solution.
Thank You!
02-04-2014 12:32 AM
Finally we've got installed Rel. 5.1.0
The situation is a little bit better, but still, recently mibd_interface process have crashed couple of times.
 
					
				
		
02-04-2014 05:12 AM
Marks, that is not good. Can you immediately file a tac case please and have these crashes analyzed?
Apologies for that.
regards
xander
02-04-2014 05:38 AM
It's okay. Yes, I have raised TAC case
 
					
				
				
			
		
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide