HIGH CPU SNMP process - snmpbulkget

marx82 · ‎11-08-2024

This is just for your information as I spent some time with this because I could not find info googling around.

Problem:
High CPU for 12 to 50 seconds (range 50 to 100%) caused by SNMP engine process when router was polled/discovered.
This caused SNMP timeout on our monitoring systems.

Affected MIB:
1.3.6.1.4.1.9.9.640 // SNMPv2-SMI::enterprises.9.9.640

Problem observed on:
ISR4331 , 800 series running on 16.9.5, 17.06.04 , 15.5(3)M6a.
Platform not affected:
ASR1001-X ...probably the have more processing power.

Trigger example:

snmpbulkget -v3 -l authpriv -u USER -a SHA -A "xxx" -x AES -X "yyy" routername 1.3.6.1.4.1.9.9.590.1.5.1.1.3

The above command will probably result in a timeout. The SNMP request should be answered if you increase the timeout to 10 sec (-t 10).

Complete output when snmpbulkget is answered:
SNMPv2-SMI::enterprises.9.9.613.1.1.1.0 = Gauge32: 32768
SNMPv2-SMI::enterprises.9.9.613.1.1.2.0 = Gauge32: 0
SNMPv2-SMI::enterprises.9.9.613.1.5.1.0 = Hex-STRING: 00
SNMPv2-SMI::enterprises.9.9.639.1.2.7.0 = INTEGER: 2
SNMPv2-SMI::enterprises.9.9.640.1.1.1.2.1.3.101.115.103 = STRING: "ipbasek9"
SNMPv2-SMI::enterprises.9.9.640.1.1.1.3.1.3.101.115.103 = ""
SNMPv2-SMI::enterprises.9.9.640.1.1.1.4.1.3.101.115.103 = STRING: "ipbasek9"
SNMPv2-SMI::enterprises.9.9.640.1.1.1.5.1.3.101.115.103 = Gauge32: 1
SNMPv2-SMI::enterprises.9.9.640.1.1.1.6.1.3.101.115.103 = Gauge32: 4
SNMPv2-SMI::enterprises.9.9.640.1.1.1.7.1.3.101.115.103 = Gauge32: 1

Not sure it is a Cisco bug yet. I have a TAC case open.

MHM Cisco World · ‎11-08-2024

Use trap instead.

It lighter than poll

MHM

marx82 · ‎11-08-2024

Aren't trap generated by the router itself? Here I am talking about a recurrent management server discovery causing SNMP timeouts because of high CPU.

Joseph W. Doherty · ‎11-08-2024

Duh! (Don't mean to be offensive, but an old joke about SNMP, using SMMP, we've discovered the cause of the high CPU consumption and bandwidth consumption, it's due to SNMP).

Now a days, SNMP bandwidth consumption usually is a much smaller percentage, and CPUs also generally have much more overall capacity, so such issues aren't as noticeable as they were decades ago But, if you work hard enough, like polling a lot of data, you can relive the joys of decades past.

Another very old joke, that pertains, might be "Patient - doctor it hurts when I do this. Doctor - don't do that".

So, is there anything that can be done, beyond replacing hardware with more powerful devices? Yup, much as we did decades ago, think carefully about how SNMP should be used. For example, as suggested by @MHM Cisco World , traps do on device monitoring and notify SNMP management of a threshold crossing rather than constant polling data. Or, think carefully what data you really must see and how often.

marx82 · ‎11-13-2024

Thanks .I understand your point .It was a long time since I haven't dealt with snmp issue. From a user point of view it is not possible to discriminate between a bug and an expected behaviour. The system uses a lot of CPU to retrieve a couple of information and that seemed an anomaly to me regardless of what Cisco can say about it. There are a lot of untold things behind the scenes. So well, the reason of my post was just to report it.
As a side note , I did a couple of additional tests increasing the SNMP timeout.It reduced the CPU usage to 12-15 seconds on some platforms (instead of 40-50 sec). I suppose the router was buffering the retransmissions triggered server side causing a longer CPU usage. In production we will exclude the OIDs.

Joseph W. Doherty · ‎11-13-2024

I'm glad to read you found some mitigation possible.

Something else that might help, if management host is on a different subnet, if not already enabled, on Cisco network device, enable PMTUD.

marx82 · ‎11-15-2024

Most likely hitting CSCvz48434 : Bug Search Tool as per Cisco TAC info.

"After my discussions with the developers, they informed me that the old licensing mechanism had several high-CPU related bugs in the MIBs and this is mentioned in internal notes of this bug, and the developers removed the old licensing mechanism completely in 17.11.1, so once the MIB is disabled or excluded (as we did), the high CPU issue in it will not occur, because the code will not execute at all, and by upgrading the device to 17.11.1 or onwards, the issue will be resolved."