6500 Crahed : RP is being reset by the SP

kthned · ‎02-18-2013

Hi,

One of our edge router (6500, running ios 12.2(33)SXJ) crashed with the following error. I found there were some RP-SP-ping GOLD test skipped due to high SP CPU utilization, but dont know if thats the reason. Crashinfo is attached. Hope to get you expert opinion on this. Shall we go for upgrade ?

Feb 17 15:19:10: %C6K_PLATFORM-2-PEER_RESET: RP is being reset by the SP

%Software-forced reload

15:19:10 met Sun Feb 17 2013: Breakpoint exception, CPU signal 23, PC = 0x42E24578

Thanks in advance.

Regards,

Umair

Douglas Holmes · ‎02-18-2013

I looked over your crash log. I am not an expert. I can say that your code is kinda out of date. You can open a ticket with tac for review of your crash log. They will also require a show tech. Any idea why the high cpu utilization at the time of the crash?

Nicholas Oliver · ‎02-25-2013

Umair,

The crashinfo file that you have included came from the RP, it would have been found in either bootflash or bootdisk (or the slave equivalents). The following line gives us an indication that we need to look to the SP:

Feb 17 15:19:10: %C6K_PLATFORM-2-PEER_RESET: RP is being reset by the SP

This line tells us that the SP reset, and as a result the RP went down. The SP should provide more context into why the issue occurred. Crashinfo files for the SP will be stored in either sup-bootflash: or sup-bootdisk: (or the slave equivalents). This file may have the same name and date/timestamp as the one you retrieved, but it is from the SP perspective. Could you provide that file for further review?

Here is a page that shows the process of retrieving both crashinfo files in a situation like yours:

https://supportforums.cisco.com/docs/DOC-19727

If you have any questions, let me know.

-Nick

kthned · ‎02-25-2013

Thanks Nick for your help. Yes I know there should be SP crash file similar to RP. But I could not spot this file or even a debuginfo file as mentioned in the caveat section of the document.

I am just wondering if this bug id below reflect to such situation:

(ping sp rp)

CSCsc33990

If you scroll down to last part in the file where some GOLD test were skipped or fail. You can see there are many "TestSPRPInbandPing " tests failed/skipped due to high inband traffic (100k transmit rate). And on top of that, if you count TestSPRPInbandPing test, it is exactly equals to 10. So Wondering.... if this is the correctly spotted bug due to high traffic rate.

appreciate your comments and time !

Thanks !

//Umair

Nicholas Oliver · ‎02-25-2013

Umair,

THe span between the first and the last failed/skipped that you mention is almost 45 minutes. I would not look at all of these as connected. That is not to say that the reason behind this reset was due to high inband traffic, it *could* have been, but with just the RP crashinfo file I do not see sufficient information to know for certain. The bug you mention, CSCsc33990 is already integrated in the SXJ code that you are running.

Do you have any outputs that would show the level of the CPU for the SP and RP leading up to the time this event occurred? That may offer a view of what was taking place at that moment.

-Nick

kthned · ‎02-25-2013

Thanks Nick for your help. I see 12.2(33)SXJ under known affected version. So this very bug do exist in this version but you are right that its not trivial to connect the bug description with the crash info. We are missing information here. Should the switch created the SP crash info, it would be much easier.

Anyway, thanks for the help.

-umair

Nicholas Oliver · ‎02-26-2013

Umair,

If you see this in an "affected versions" field somewhere it is a flaw in the way that the script is determining which releases are impacted by which bugs. I can confirm that CSCsc33990 is already fixed in 12.2(33)SXJ. This code already contains the fix for this bug. I can see that through code inspection. The bug toolkit and other tools attempt to use broad strokes to catch all versions impacted by a particular bug through the use of scripts, and every once in a while they throw a false negative indicating that a particular version is impacted, when it is actually already fixed. This appears to be one of those cases.

If you have additional questions, let me know.

-Nick

kthned · ‎02-26-2013

Thanks Nick ! It means that more consideration should be given on "1st Found-In" section than Known Affected Versions. As there could be a chances of false positive.

-umair