a customer observes SNMP timeout problems on a Cat6500 with IOS 12.2(33)SXI. As a result every 2 days (more or less) all interfaces are marked DOWN and a couple of minutes later all interface are up again - but in fact there is no interruption, it is just the snmp request getting a timout...
the customer does not have problems with IOS 12.2(18)SXF14
In the BugDetails the "Known affected versions" lists the following beside others:
Now I am cunfused; both IOS versions are listed as affected (12.2(33)SXI and 12.2(18)SXF14) but the customer does have problems only with one version.
Is the customer hitting this bug or is it another one ?
He upgraded 2 Cat65xx on which he observed the problem to IOS 12.2(33)SXI6 and the problem is gone; is this just a coincidence or is CSCed52841 really fixed in 12.2(33)SXI6.
This version is not listed as affected but on the other hand, "Fixed-In" lists only these 3:
Before going to upgrade around 50 Core / Distribution switches the customer wants to be sure with the IOS version.
Tracing the issue is not that easy because the failure occures only from time to time..
I would certainly take the result of the 2 upgraded switches at heart.
The updates on the bugid notes is sometimes a month or two late in my experience, so the results of tests on the network is more relevant to me that what a bugid says.
Why were these switch upgrade to 12.2(33)SXI6 and not the latest IOS in the train? I have my doubts about cisco updating bugid notes but they are pretty good in regression testing :-).
thanks Michel for your response. Because core switches are affected we have to be sure that an IOS update will fix the issue and all features used are supported and "bug-free" - but you know these kind of stories...
And certainly, it would be greate if we finally know the reason for this issue - and not just "avoid" it by using another IOS release without being sure that it will not reappear under certain circumstances.
Currently I cannot say how they decided to use the 12.2(33)SXI6 IOS release.
Of course Martin,
I always try to explain to my customers, there is no such thing as certainty, just increased probability. Indeed they don't want to know. Most of them however simply can't afford to wait for the certainty though.
You expirience will tell you how comparable the switches that were upgraded are compared to the backbone switches. If the switches are used in an entirely different way then we obviously don't even know if they were sufferering from the same defect.
And even if your TAC engineer is 100% sure what went wrong, and that it is certainly fixed in version X.Y.Z, you are still required to have a fallback plan to a previous (no so good but still mostly) working state.
I think the "fix" is a coincidence. The bug you reference is quite old and no specific fix was ever made to the SXI branch. There have, however, been numerous fixes between SXI and SXI6 that could account for timing fixes. Seeing a stack trace of the SNMP ENGINE and IP SNMP processes would help identify potential candidates.
Am I right if I assume that it is necessary to get the stack trace while the issue appears? If so, do you have a suggestion how this could be done automatically - according to the customer it would be hardly possible to have a session open to the switcht that is currently affected.
If there is an distribution switch which is affected more often I thought about EEM to get a stack trace every minute or so and append the output to a file on flash... - but I am not sure if this is realistic idea...
Yes, the stack traces will need to be obtained at the time the problem is occurring. Since there is not necessarilly an EEM-visible trigger that could kick off your policy, the timer ED is a way to go (provided you have enough disk space and you can reproduce this fairly easily). You could add another EEM policy to run in the "off time" to email the file on flash, then delete it to prevent flash from filling up.