do we hit BugID CSCed52841? is it fixed in IOS 12.2(33)SXI6 ?

Martin Ermel · ‎06-30-2011

a customer observes SNMP timeout problems on a Cat6500 with IOS 12.2(33)SXI. As a result every 2 days (more or less) all interfaces are marked DOWN and a couple of minutes later all interface are up again - but in fact there is no interruption, it is just the snmp request getting a timout...

the customer does not have problems with IOS 12.2(18)SXF14

In the BugDetails the "Known affected versions" lists the following beside others:

[...]

12.2(33)SXI

12.2(33)SXI1

12.2(33)SXI2

12.2(33)SXI2a

12.2(33)SXI3

12.2(33)SXI3a

12.2(33)SXI3z

12.2(33)SXI4

12.2(33)SXI4a

12.2(999)SXI

[...]

12.2(18)SXF

12.2(18)SXF1

12.2(18)SXF2

12.2(18)SXF3

12.2(18)SXF4

12.2(18)SXF5

12.2(18)SXF6

12.2(18)SXF7

12.2(18)SXF8

12.2(18)SXF9

12.2(18)SXF10

12.2(18)SXF10a

12.2(18)SXF11

12.2(18)SXF12

12.2(18)SXF12a

12.2(18)SXF13

12.2(18)SXF13a

12.2(18)SXF13b

12.2(18)SXF14

12.2(18)SXF15

12.2(18)SXF15a

12.2(18)SXF16

12.2(18)SXF17

12.2(18)SXF17a

[...]

Now I am cunfused; both IOS versions are listed as affected (12.2(33)SXI and 12.2(18)SXF14) but the customer does have problems only with one version.

Is the customer hitting this bug or is it another one ?

He upgraded 2 Cat65xx on which he observed the problem to IOS 12.2(33)SXI6 and the problem is gone; is this just a coincidence or is CSCed52841 really fixed in 12.2(33)SXI6.

This version is not listed as affected but on the other hand, "Fixed-In" lists only these 3:

12.1(22.3)E1

12.2(17d)SXB5

12.2(18)SXD

Before going to upgrade around 50 Core / Distribution switches the customer wants to be sure with the IOS version.

Tracing the issue is not that easy because the failure occures only from time to time..

Michel Hegeraat · ‎06-30-2011

Hi Martin,

I would certainly take the result of the 2 upgraded switches at heart.

The updates on the bugid notes is sometimes a month or two late in my experience, so the results of tests on the network is more relevant to me that what a bugid says.

Why were these switch upgrade to 12.2(33)SXI6 and not the latest IOS in the train? I have my doubts about cisco updating bugid notes but they are pretty good in regression testing :-).

Cheers,

Michel

Martin Ermel · ‎07-11-2011

thanks Michel for your response. Because core switches are affected we have to be sure that an IOS update will fix the issue and all features used are supported and "bug-free" - but you know these kind of stories...

And certainly, it would be greate if we finally know the reason for this issue - and not just "avoid" it by using another IOS release without being sure that it will not reappear under certain circumstances.

Currently I cannot say how they decided to use the 12.2(33)SXI6 IOS release.

Michel Hegeraat · ‎07-11-2011

Of course Martin,

I always try to explain to my customers, there is no such thing as certainty, just increased probability. Indeed they don't want to know. Most of them however simply can't afford to wait for the certainty though.

You expirience will tell you how comparable the switches that were upgraded are compared to the backbone switches. If the switches are used in an entirely different way then we obviously don't even know if they were sufferering from the same defect.

And even if your TAC engineer is 100% sure what went wrong, and that it is certainly fixed in version X.Y.Z, you are still required to have a fallback plan to a previous (no so good but still mostly) working state.

Good luck,

Michel

Joe Clarke · ‎07-01-2011

I think the "fix" is a coincidence. The bug you reference is quite old and no specific fix was ever made to the SXI branch. There have, however, been numerous fixes between SXI and SXI6 that could account for timing fixes. Seeing a stack trace of the SNMP ENGINE and IP SNMP processes would help identify potential candidates.

Martin Ermel · ‎07-11-2011

Am I right if I assume that it is necessary to get the stack trace while the issue appears? If so, do you have a suggestion how this could be done automatically - according to the customer it would be hardly possible to have a session open to the switcht that is currently affected.

If there is an distribution switch which is affected more often I thought about EEM to get a stack trace every minute or so and append the output to a file on flash... - but I am not sure if this is realistic idea...

Joe Clarke · ‎07-16-2011

Yes, the stack traces will need to be obtained at the time the problem is occurring. Since there is not necessarilly an EEM-visible trigger that could kick off your policy, the timer ED is a way to go (provided you have enough disk space and you can reproduce this fairly easily). You could add another EEM policy to run in the "off time" to email the file on flash, then delete it to prevent flash from filling up.