on 03-25-2013 09:43 AM
In this document we'll be discussing the SNMP architecture as it is implemented in IOS-XR. As you can read in the IOS to XR migration guide (A starting point), some of the high level differences between IOS and XR are already being highlighted.
As IOS-XR is a highly distributed operating system and is using hardware forwarding, the way that SNMP retrieves counts and responds to requests is a bit different then what you might be used to and in this article we deep dive into the architecture of stats collection, how it operates and what show commands you can use to verify the performance of your SNMP in regards to IOS-XR and specifically for the ASR9000 (though this article also applies to CRS and GSR running IOS-XR).
XR routers are highly distributed. Increasing capacity by distribution and replication does come at a cost. In any scaled design where replication or multiplication of the processing devices is used, a critical additional component is the design is the inter process communication path between the processing components
The nature for this article originated from the fact that some of our customers have seen SNMP timeouts in XR 4.2.3 and has raised a lot of questions in regards to caching, stats collection and the way SNMP operates. Hopefully with this technote we can clear up some of the confusion.
This section describes the symptoms of the problem and the main issue the document resolves.
Depending on your configuration SNMP packets can be received in band or out of band (as per MPP definitions, see article on LPTS and MPP for more info) and after intial reception and punting to the control plane (RSP), they are handed over to NETIO. NETIO is sort of an IP INPUT process in IOS that deals with process level switching.
IF the SNMP requests are "for me" they are handed over to the SNMP-D process for evaluation of the request and dispatch to the next layer of processing.
ENTITY-MIB
IF-MIB
IP MIBs support
Routing MIBs support (BGP, OSPF, ISIS, etc)
MPLS, Pseudowire, VPLS MIBs support
IEEE 802x (LAG, CFM, OAM)
SNMP request processing blocked during critical event periods (i.e. OSPF convergence)
Debugability:
Caching is an integral part of IOS XR SNMP processing allowing it to perform at best performance while maintaining the most accurate stats possible.
There are various levels of caching and some of them are configurable, some of them are not. The reason why we cache is also to alleviate the hardware from the burden of getting continuous requests, especially in WALK scenarios retrieving many requests for eg interface stats counters.
There is a process called STATS-D which is a proc running on the linecard that periodically scrapes statistics of the linecards hardware and updates the interface counters and MIB stats.
This means that if you poll within the stats-D update time, you'd realistically see the same counter being returned twice.
Show interface commands (depending on release) will force a direct update read from hardware to get the most accurate reading, but the IF-MIB stats are cached.
Visualizing caching differently:
Two caching mechanisms:
1: Statsd caching:
Used for interface related statistics (IF-MIB, IF-EXTENSON-MIB, etc.)
Statsd caching is configurable (via CLI).
2: Lookahead caching:
Conceptually a varbind cache.
Not configurable.
Not all MIBs leverage/use this cache.
Statsd cache:
System maintains look ahead cache:
***Data for up to 500 interfaces kept in cache
The following picture tries to explain what the serialized processing means:
When an SNMP request is being received they are handled in a sequential manner. If one request currently in progress is "slow", subsequent requests are waiting to be handled and may time out.
The NMS station may resend its SNMP request building up the request queue potentially causing more trouble.
Now the good news is in 431 we have the ability to detect duplicate requests and throw them out of the queue making sure we're dealing only with "NEW" requests.
StatsD is a process that collects statistics from various places (eg hardware) and updates tables on the LC shared memory.
IPC is an inter process call or communication that is used by processes to talk to each other to request data or send commands.
GSP is group services protocol, which is a process in IOS-XR that allows for one process to communicate with multiple "nodes" at the same time (like a sort of multicast way that the RSP can use to talk to multiple linecards, for instance to update a FIB route).
The following show and debug commands are very powerful to verify and track SNMP.
show snmp | Global agent counters—incoming, outgoing (request and trap), & error PDUs Technique: - Periodically collect output to determine overall PDU response rate and identify error rate. |
show snmp trace requests | Log of high level PDU processing tracepoints—Rx, Proc Start, Tx time Technique: Periodically collect this log. Decode and use the data to determine the following per-PDU data: 1.Source IPs of pollers 2.Queue lengths of per-source IP PDU queues 3.Types of request PDUs being used 4.Timestamp when PDUs are enqueued into the queues for the source IPs 5.Duration of the PDU enqueued & waiting to be processed 6.Processing time of PDUs from pollers |
show snmp mib access | Per-OID counters indicating the number of times an operation was done on that OID, ie. GET, GETN, SET. Technique: Periodically collecting & diff will indicate what was polled during the time periods. |
show snmp mib access time |
Per-OID timestamp of the last operation on the OID. Technique: Periodically collecting & diff will indicate if any polling on the OID was done in the time period. |
debug snmp request (careful!!!) | Enable to log every OID being processed by every PDU to syslog. Need to enable “debug snmp packet” as well to identify source of PDUs. NOTE: Disable “logging trap debug” if “snmp trap syslog” is configured!!! |
debug snmp packet (careful!!!) | Enable to log same data as “sh snmp trace requests” to syslog. NOTE: Disable “logging trap debug” if “snmp trap syslog” is configured!!! |
show snmp mib statistics | Per-OID statistics summarizing transaction times within the mibd level—count + min/max/avg . Technique: Collect to determine if specific MIB objects are averaging high processing times and/or large variance (low min, high avg & max). |
show snmp queue rx | Indicates the min/max/avg queue sizes for the PDU receive and pending queues. Real-time and 5min views. |
show snmp queue trap | Indicates the min/max/avg queue sizes for the internal trap PDU queue |
(config)# snmp logging thresh oid show snmp trace slow oid | Allows configuring a duration threshold for logging per-OID transactions exceeding the time threshold. This is measured within the mibd process beginning with the call to the MIB specific handler for the OID and ending with the response from the same. |
(config)# snmp logging thresh pdu show snmp trace slow pdu |
Allows configuring a duration threshold for logging per-PDU transactions exceeding the time threshold. When logging all OIDs within the PDU are also logged to this buffer. This is measured within the snmpd process beginning with the dequeue of the PDU from the receive queue and ending when all the OIDs in the PDU have been processed and the response is ready to be sent. |
Some MIBs dont have accelerated processing or dont have caching and because in certain releases SNMP is processed serially, it could happen that you'll see timeouts on OID requests that are normally operating perfectly fine. An example of a slow MIB is the SONET MIB. Because this mib needs to talk from the SNMP process all the way down to the SPA of the SIP-700 linecard (on the ASR9000), the response may not be provided in a timely manner. At the same time new requests for other OID's may be in the holding or pending queue causing timeouts and retries.
Retries to an already under performing MIB may exacerbate the overal issue.
The vast majority of PDU performance issues are related to a poller polling a specific MIB which is slow to process its OIDs.
This causes all other pollers to see some of their PDUs slowed due to queueing delays (waiting on slow MIB)
Use SNMP View Access Control to block access to the slow MIB tables / objects
snmp-server view MyPollView <slow MIB OID> excluded
Use ACLs to permit only “known” NMS devices/applications . In this case “known” is referring to content of requests issued from the app
snmpd will timeout a mibd process if it has not received a response to a request for an OID/s within 10s by default.
Once in timeout state, snmpd will continue processing requests BUT it will mark the mibd as unavailable until it responds to the timed-out request.
(in addition to normal “slow OID” techniques):
For the purpose of clarification the following is an example of an snmp table. The columns (vertical) represent the instance or entity, and the rows represent the objects. In this case we have 3 instances 1, 2 and 3, and each instance has 3 objects, ifName, ifInOctets and ifMtu.
ifIndex | ifName | ifInOctets | ifMtu |
1 | Ethernet1/0 | 1234 | 1500 |
2 | POS2/0 | 512 | 500 |
3 | Serial3/0 | 235 | 600 |
The customers current snmp design is using snmpwalk. Snmpwalk works by performing a sequence of get-nexts, but on a column by column basis if the column object is specified as the starting point.
An example of a column walk specifying the ifDescr from IF-MIB
[no-sense-1 68] ~ > snmpwalk -c public 10.66.70.87 IF-MIB::ifDescr
IF-MIB::ifDescr.1 = STRING: Loopback0
IF-MIB::ifDescr.2 = STRING: Bundle-POS1
IF-MIB::ifDescr.3 = STRING: Bundle-Ether1
IF-MIB::ifDescr.4 = STRING: TenGigE1/2/0/0
IF-MIB::ifDescr.5 = STRING: TenGigE1/2/0/1
IF-MIB::ifDescr.6 = STRING: SONET0/2/0/0
IF-MIB::ifDescr.7 = STRING: SONET0/2/0/1
IF-MIB::ifDescr.8 = STRING: SONET0/2/0/2
IF-MIB::ifDescr.9 = STRING: SONET0/2/0/3
IF-MIB::ifDescr.10 = STRING: SONET0/2/0/4
<cut>
Snmpwalk can also be used to get a single object only, for instance, the object IF-MIB::ifDescr.9. It does not support the ability to specify any more than 1 object in its request. The example below shows two objects being requested, but only the first returned.
[no-sense-1 69] ~ > snmpwalk -c public 10.66.70.87 IF-MIB::ifDescr.9
IF-MIB::ifDescr.9 = STRING: SONET0/2/0/3
[12:18 - 0.31]
[no-sense-1 70] ~ > snmpwalk -c public 10.66.70.87 IF-MIB::ifDescr.9 IF-MIB::ifDescr.10
IF-MIB::ifDescr.9 = STRING: SONET0/2/0/3
[12:18 - 0.36]
For efficiency row traversal is preferred, with multiple objects requested in a single snmp transaction. This reduces unnecessary overhead on the XR system. For this reason snmpwalk is not recommended.
The customer is currently requesting via snmpwalk the following IF-MIB objects
ifDescr
ifHCInOctets
ifHCOutOctets
ifHCInUcastPkts
ifHCOutUcastPkts
ifInNUcastPkts
ifOutNUcastPkts
ifInOctets
ifOutOctets
ifInUcastPkts
ifOutUcastPkts
The preferred method is to specify all the objects required from an instance/entity in a single command such as get-next or bulk-get. An example follows using snmpbulkget
[no-sense-1 115] ~ > snmpbulkget -v 2c -c public 10.66.70.87 IF-MIB::ifDescr IF-MIB::ifHCInOctets IF-MIB::ifHCOutOctets IF-MIB::ifHCInUcastPkts IF-MIB::ifHCOutUcastPkts IF-MIB::ifInNUcastPkts IF-MIB::ifOutNUcastPkts IF-MIB::ifInOctets IF-MIB::ifOutOctets IF-MIB::ifInUcastPkts IF-MIB::ifOutUcastPkts
IF-MIB::ifDescr.1 = STRING: Loopback0
IF-MIB::ifHCInOctets.2 = Counter64: 0
IF-MIB::ifHCOutOctets.2 = Counter64: 7116596
IF-MIB::ifHCInUcastPkts.2 = Counter64: 0
IF-MIB::ifHCOutUcastPkts.2 = Counter64: 99611
IF-MIB::ifInDiscards.2 = Counter32: 0
IF-MIB::ifOutDiscards.2 = Counter32: 0
IF-MIB::ifInOctets.2 = Counter32: 0
IF-MIB::ifOutOctets.2 = Counter32: 7116596
IF-MIB::ifInUcastPkts.2 = Counter32: 0
IF-MIB::ifOutUcastPkts.2 = Counter32: 99611
IF-MIB::ifDescr.2 = STRING: Bundle-POS1
IF-MIB::ifHCInOctets.3 = Counter64: 38796828
IF-MIB::ifHCOutOctets.3 = Counter64: 66076323
IF-MIB::ifHCInUcastPkts.3 = Counter64: 331833
IF-MIB::ifHCOutUcastPkts.3 = Counter64: 402546
IF-MIB::ifInDiscards.3 = Counter32: 0
IF-MIB::ifOutDiscards.3 = Counter32: 0
IF-MIB::ifInOctets.3 = Counter32: 38796828
IF-MIB::ifOutOctets.3 = Counter32: 66076323
IF-MIB::ifInUcastPkts.3 = Counter32: 331833
IF-MIB::ifOutUcastPkts.3 = Counter32: 402546
IF-MIB::ifDescr.3 = STRING: Bundle-Ether1
<snip>
Note above that all the objects in a row for all instances (columns) are obtained with one command. The same can be done with a get-next, however the added overhead of including the instance must be used for each instance present.
[no-sense-1 120] ~ > snmpgetnext -v 2c -c public 10.66.70.87 IF-MIB::ifDescr.1 IF-MIB::ifHCInOctets.1 IF-MIB::ifHCOutOctets.1 IF-MIB::ifHCInUcastPkts.1 IF-MIB::ifHCOutUcastPkts.1 IF-MIB::ifInNUcastPkts.1 IF-MIB::ifInOctets.1 IF-MIB::ifOutOctets.1 IF-MIB::ifInUcastPkts.1 IF-MIB::ifOutUcastPkts.1
IF-MIB::ifDescr.2 = STRING: Bundle-POS1
IF-MIB::ifHCInOctets.2 = Counter64: 0
IF-MIB::ifHCOutOctets.2 = Counter64: 7116596
IF-MIB::ifHCInUcastPkts.2 = Counter64: 0
IF-MIB::ifHCOutUcastPkts.2 = Counter64: 99611
IF-MIB::ifInDiscards.2 = Counter32: 0
IF-MIB::ifInOctets.2 = Counter32: 0
IF-MIB::ifOutOctets.2 = Counter32: 7116596
IF-MIB::ifInUcastPkts.2 = Counter32: 0
IF-MIB::ifOutUcastPkts.2 = Counter32: 99611
[13:03 - 0.35]
Although the examples are specific to IF-MIB, the same concept is relevant to all MIBs.
Timeout recommenations:
Retry recommendations:
For more details refer to IOS XR SNMP Best Practices.
Special thanks to the XR SNMP dev team for some of the amazing content used in this article, most notably Timothy Swanson and Leon Zachary
Xander Thuijs CCIE #6775
Principal Engineer, ASR9000
Hi Xander,
I have a problem for implementing snmp on asr9k.
Is "snmp-server host" mandatory for implementing snmp on asr9k?
I've configured only snmp-server community and snmp-server traps.
but if i type "sh snmp trace requests" command, it only show Rx and processing but no Tx.
May 22 17:06:29.981 snmp/requests 0/RSP0/CPU0 t10 Rx PDU from x.x.x.x,27129 len = 35 [Q = 1]
May 22 17:06:29.981 snmp/requests 0/RSP0/CPU0 t1 Processing PDU from x.x.x.x,27129 req_id = 1 (0ms on Q), type GETN
May 22 17:06:34.988 snmp/requests 0/RSP0/CPU0 t10 Rx PDU from x.x.x.x,27129 len = 35 [Q = 1]
Please advice.
Regards,
Purwo
Hi Purwo,
You dont need :
snmp-server host
To be able to implement snmp, It is only to used to specify SNMP trap notifications, the version of SNMP to use, the security level of the notifications, and the recipient (host) of the notifications.
What you are seeing in show snmp trace is not what we expect to see, you should see both:
RP/0/RSP0/CPU0:PE2#sh snmp trace requests
Sun Jun 2 08:57:13.652 PDT
Entering snmp_ltrace main....
2107 wrapping entries (2112 possible, 0 filtered, 4078884 total)
May 20 19:10:59.194 snmp/requests 0/RSP0/CPU0 2947497# t8 Tx PDU to 1.73.54.10,34458 len = 49
May 20 19:10:59.196 snmp/requests 0/RSP0/CPU0 3760763# t9 Rx PDU from 1.73.54.10,34885 len = 46 [Q = 1]
Are you sure this is still the case and you are seeing the requests going from the NMS?
Regards,
Ahmed
Hi Ahmed,
Thanks for your kind respons.
It's doing fine. It seem i was wrong at configuring the ACL.
Regards,
Purwo
Hi Alexander,
Just curious and want to know as based on the which RFC, SNMP architecture of IOS-XR had been adopted?
Regards
K. Lakshmi Ganesh
I don't have a specifc RFC that defines what archticture is followed. In general we always follow the specifications and exceptions are documented.
Is there a particular functionality or (mib) implemention you are interested in?
We have a good overview of which mibs are specifically tested against asr9000 listed here:
ftp://ftp.cisco.com/pub/mibs/supportlists/asr9000/asr9000-supportlist.html
Some platform independent MIBs, that have no hardware dependency, such as the OSPF mib are maybe not listed, but definitely supported. The mibs in this list have some sort of HW dependency hence specifically tested against asr9000.
regards
xander
Hi Alexander,
Thank you for your reply.
Actually i was just going through the related RFCs and additionally we (HTTS-WW-NMS) are going to support SNMP K/W for XR so was just going through these.
Informatively in IOS RFC 2571 was being followed for implementation of SNMP architecture. In Nexus RFC 3411 (which obsoletes RFC 2571) is being used. So was just wondering as which RFC is being followed in XR?
Any idea as who can comment on this definitively?
Thanks and Regards
K. Lakshmi Ganesh
We dont claim support for 3411 natively. But it is fully compliant with 2571.
What precise changes proposed from 3411 is it that you're interested in? because it may be natively supported already without calling out 3411 official compliance.
cheers
xander
Hi Alexander,
Thanks for your kind reply.
As i noted above, in personal capacity i am in study mode going about SNMP K/W so also looking out the relevant RFCs to be thorough with. So as part of that only interested to know w.r.t IOS-XR also. Informatively we did support SNMP K/W for IOS and NX-OS at the moment and in transition to support XR also.
Regards
K. Lakshmi Ganesh
I am not sure the retries and extended timeouts help.
Our Cacti box has "gaps" in the graphs, to find out whats going on I setup the following bash script:
while [ $? -eq 0 ]; do sleep 1; date; snmpget -v 2c -c public -r 10 -t 5 10.1.2.130 .1.3.6.1.2.1.31.1.1.1.10.264
This script bombs out every 30 mins or so, across all three of our four ASR9k's (a combination of 4.2.3 unpatched, 4.3.1, 4.2.3 with some SMU's)
Increasing the timeout (-t) from 1 to 5 seconds helped a little, but increasing to 10 seconds didn't do much.. Also the retries (-r) didn't do much...
It doesn't seem to matter how many items we poll for either (for example 1 ASR has 28 graphs (so maybe 40 individual OID's) and runs fine and another has maybe 10 OID's and has issues.
for the record the management network that is being used services an number of other devices (srx650's and nexus7k's) and none of those show snmp dropouts...
quite frustrating to have gappy graphs
What version are you running Davecs? There have been a set of SNMP smu's out recently and over time that fix a lot of these timeout issues.
The reason for the timeout I can't say with the info provided, some of the performance tricks discussed in this article may help find out where the problem is, or you can try the SNMP umbrella smu's that are out there now.
regards
xander
As mentioned:
4.3.1 some SMU's installed, not relating to SNMP. However CSM says no SNMP SMU's are available.
4.2.3 with CSCuf51534 installed, again CSM says no other SNMP SMU's are available.
Unless there is another SMU i should install?
Hi Dave, if you have an RSP440, you may need CSCug55767
it could be the sonetmib is acting you up if you do a walk. you could potentially exclude that from your snmp veiw and see if that makes a difference. If you are targeting actual individual OID's without walk this may not do it for you.
it might be good to get the perf stats from snmp as per document in that case and file a tac case.
I see there is another smu in the pipeline for snmp performance also, but in the absence of that logging I cant say if you'll benefit from those changes either.
Another thing might be that if you have optics that are enabled with service unsupported tranceiver/tranceiver permit pid all you may be running into another known issue (if you're pulling IFMIB stats).
regards
xander
We do use RSP440's, but IOSXR-4.3.1 has the same problem, and I am guessing that 55767 is already included in 4.3.1.
We do use tranceiver permit pid all for some non-working Cisco GE-T's, is there a fix coming for either the SFP's or the IFMIB stats for these ports?
Also with the above document, how do I give myself the correct priveleges to run those commands? Do I need to login as cisco-support or create an account in that group or something?
yeah the umbella smu fixes are already in XR43.
I dont have the ddts handy for the optic situation I was referencing, but it is going to be a smu for 432 and integrated in 434 (not out yet).
to run the commands, you need cisco-support privileges, int eh username <name> config in either admin or exec config you would need to add the group cisco-support, logout and in again to get that task group applied.
show user tasks shold then be the full list and show user group will tell you that group membership.
xander
thanks xander.
if you can you let me know the bugid that relates to that optics issue that would be great!
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: