This blog aims to explain what do if you see SNMP-3-INPUT_QFULL_ERR or SNMP-3-RESPONSE_DELAYED errors. There is a TL;DR toward the bottom.
Working in TAC for over 10 years now, I have grown tired of seeing so many cases with this error message:
%SNMP-3-INPUT_QFULL_ERR: Packet dropped due to input
Very often, I saw TAC engineers giving the wrong info about what these errors are and how to deal with them. For example, the common recommendations:
1) Remove and then reconfigure SNMP configs
- Problem: While this is likely to get rid of the problem in the short term, it does nothing to solve the problem for good. That's not really what Cisco is aiming for.
2) Use the "snmp queue-length" command
- Problem: This is actually guaranteed NOT to work, since the queue being increased is the one used for snmp traps leaving the box, not the processing of SNMP requests received by the box.
3) Don't poll the device so often or with as many SNMP servers
- Problem: It's natural to think that a queue filling up (as the error describes) could be due to congestion of too much polling. Maybe this was the root cause when CPUs were slower, but in my lab testing, I showed that modern Cisco platforms (those 5 years old or newer) do NOT suffer from having up to 3 (what I tested) servers continuously polling.
4) Find the guilty OID (which only TAC can do) then block it in an snmp-view
- Problem: First, this requires customers to open a case with TAC, so there's no hope of solving the issue with a simple search in this or other support communities. As a user myself, I love it when a problem I'm seeing is already documented online, with the final fix given to me. Second, blocking the OID will have an impact on network management stations that may have good reason to fetch the OID.
I took a closer look at as many cases as I could find to figure out how to make this class of problem easier to solve for all of us users. My research showed that there are a fairly small set of software bugs that address these issues, usually when an OID takes multiple seconds to process, rather than just milliseconds in the routine case. Generally each bug is uniquely identified by platform and OID. For example, if you see INPUT_QFULL_ERR messages on a 3850 running 3.2 code you are probably hitting, CSCuo12316. That's still not very user friendly. I wanted something easier and more reliable. I reasoned that if the error itself could tell us what the slow OID was, then a simple string matching search could identify the bug that fixes the issue (I don't think I found any cases where the root cause was hardware failure).
The new feature to do this, added by CSCuz93302, is called SNMP monitoring, and will print an error similar to the following in cases where an SNMP request takes an IOS/IOS-XE device more than 2 seconds (by default) to process:
%SNMP-3-RESPONSE_DELAYED: processing GetNext of ciscoFlashFileEntry.184.108.40.206
We tested this monitoring feature in a VERY high scale lab without seeing any of the above, which gives us confidence that the errors should not be seen in a standard setup. If you see an error like this, and can't find an existing bug from your own searching, then please open a TAC case so that Cisco can investigate the root cause and fix. Some additional useful information would be:
- is the problem seen if you do a manual "snmpwalk" on the same OID from a server with snmpwalk installed.
- is there high cpu at the same time the logs are seen. If the CPU is high for the SNMP ENGINE process, then that will be great for TAC to know. But, if the CPU is high, without the SNMP ENGINE process being high, then the slow snmp response is probably just a symptom of the system being oversubscribed (in the case of interrupt cpu) or non-optimized (in the case of another process showing high cpu). If CPU isn't high at all, that's still probably a software bug, perhaps on a linecard. If you see a lot of different OIDs in your messages, then the root cause is likely not with any specific OID, but rather with the the device being busy doing other, non-snmp work.
This is my first post in a series I'm planning about IOS/IOS-XE serviceability improvements that I or others in TAC have been working on to make common problems easier to solve. Please feel free to leave feedback about this post, and ideas for improvements to Cisco Software that would save you lots of time. If anyone can help push those fixes into reality, it's your friendly, helpful TAC engineers.
TL;DR, if you see an error starting with "SNMP-3-RESPONSE_DELAYED" then search the internet for Cisco bugs where the name of the OID in the bug matches the one in your error message. If none found, or the bug you found is unfixed, then open a TAC case and have them investigate. These problems are usually software bugs that can be fixed permanently.