Re: C3750 crash - every few days - Page 2

g.leonard · ‎02-16-2011

Hi

2nd switch in 2 switch stack keeps crashing every few days or so. Please see lastest crash file below:

Cisco IOS Software, C3750 Software (C3750-IPBASE-M), Version 12.2(25)SEE2, RELEASE SOFTWARE (fc1)
Copyright (c) 1986-2006 by Cisco Systems, Inc.
Compiled Fri 28-Jul-06 08:46 by yenanh

Debug Exception (Could be NULL pointer dereference) Exception (0x2000)!

SRR0 = 0x0056B958 SRR1 = 0x00029210 SRR2 = 0x0056A1E8 SRR3 = 0x00021000
ESR = 0x00000000 DEAR = 0x00000000 TSR = 0x8C000000 DBSR = 0x10000000

CPU Register Context:
Vector = 0x00002000 PC = 0x00906338 MSR = 0x00029210 CR = 0x30000005
LR = 0x00906260 CTR = 0x00000000 XER = 0x80000047
R0 = 0x00000000 R1 = 0x026EB788 R2 = 0x00000000 R3 = 0x00000000
R4 = 0xFFFFFFFE R5 = 0x00000000 R6 = 0x026EB760 R7 = 0x00000000
R8 = 0x00029210 R9 = 0x01660000 R10 = 0x0197F080 R11 = 0x00000000
R12 = 0xA0000000 R13 = 0x00110000 R14 = 0x00578E94 R15 = 0x00000000
R16 = 0x00000000 R17 = 0x00000000 R18 = 0x00000000 R19 = 0x00000000
R20 = 0x00000000 R21 = 0x00000000 R22 = 0x00000000 R23 = 0x026E9F60
R24 = 0x00000000 R25 = 0x00000001 R26 = 0x004DD7D8 R27 = 0x00000000
R28 = 0x02635FB0 R29 = 0x00000030 R30 = 0x00000000 R31 = 0x0000000F

Stack trace:
PC = 0x00906338, SP = 0x026EB788
Frame 00: SP = 0x026EB798    PC = 0x00906238
Frame 01: SP = 0x026EB7A0    PC = 0x005747F4
Frame 02: SP = 0x026EB7D8    PC = 0x00578C98
Frame 03: SP = 0x026EB7F8    PC = 0x00578F48
Frame 04: SP = 0x026EB800    PC = 0x00908064
Frame 05: SP = 0x00000000    PC = 0x008FE62C

Can anybody tell me what is causing this?

kapathak · ‎02-18-2011

"Besides, the OP did mention that he can't reboot the switch due to the importance of the clients so what are the chances of the user upgrading the IOS?"

Well, if you are going to replace the switch, you would still have to remove it from the stack. How about removing the switch and upgrading the IOS? Takes much less time than having to wait for a replacement.

Leo Laohoo · ‎02-18-2011

Well, if you are going to replace the switch, you would still have to remove it from the stack. How about removing the switch and upgrading the IOS? Takes much less time than having to wait for a replacement.

I agree with this recommendation.

g.leonard · ‎02-21-2011

Thanks guys for your input.

The switch crashed again on Saturday.

I have been monitoring it since then and have notcied that the Hulc LED Process is being allocated memory constantly without freeing any:

PID TTY Allocated Freed Holding

110 0 769811656 192 6904

The amount of freed memory has remained at 192 since I started monitoring the switch after its crash on Saturday.

I'm guessing this a memory leak as I would expect memory to be freed if the process was being allocated memory and not increasing the amount it was holding.

Just checked again and its increased to:

PID TTY Allocated Freed Holding

110 0 772866152 192 6904

kapathak · ‎02-21-2011

Well,

This looks like CSCsj29588

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsj29588&from=summary

Memory leak on Hulc LED Process and RedEarth Rx Mana
A Catalyst 3560 or 3750 might show a memory leak where memory is being consumed by the Hulc LED Process and RedEarth Rx Manager processes. The memory leak has been resolved via an internally found bug. An upgrade to version 12.2(35)SE1 or later will resolve the memory leak.

Now, 12.2(25)SEE2 was the release known to be affected by this. It is stated that 12.2(25)SEE3 does not see this issue, therefore, its probably a high possibility that 12.2(25)SEE4 might not be affected even though it is stated as affected. I would still look into the possibilites of upgrading into newer releases.

Looks like we have found the culprit! :-D

g.leonard · ‎02-21-2011

OK, the amount of memory the process is holding has also started to increase too

PID TTY Allocated Freed Holding Process

0          0        48433856    13958000   30762100    *Init*
46        0          911048      1952          782888       Stack Mgr Notifi
110      0       786444352    192            667928      Hulc LED Process
41        0       136772520    241294648 587960      RedEarth Rx Mana

This is the top 4 memory users from "sh proc mem sort" output. RedEarth RX is in there too. I've see this before but guessed it wasn't valid as I had previously read it was fixed in 12.2.25 SEE2

Fear switch is gonna crash again at some point today causing outage for users. I'm hoping to attempt an upgrade to 12.2.25 SEE4 as I should be able to copy image over from master switch.

g.leonard · ‎02-21-2011

I've noticed we are running 12.2(25) SEB4 on another floor which has the same hardware (C3750G 48PS) in a two switch stack. Funnily enough this switch has never crashed and this version does not appear on the affected releases for the CSCsj29588 bug. (The switch was deployed at the same time as the one in question).

We do run 12.2(35) SE5 on some newer c3750E switches in another building.

I have been given an outage period tonight so will move both members of the problem switch stack to SEB4 with a view to taking the rest of the building to 12.2(35) or later once I have time to test in a lab.

g.leonard · ‎03-07-2011

OK, changed image to 12.2(25) SEB4, to be inline with other floor switches

This didn't work so SEB4 should also appear on the affected releases for the CSCsj29588 bug.

Went to 12.2.35 SE5 as this has run on 2 switches in one of our newer buildings without problem for a couple of years.

This did sort the memory issue and the switch ran fine for a week with no evidence of memory leakage.

However one morning the switch decided that it didn't like some of its ports and they just stopped working. Ran various diagnostics on the switch and checked they hadn't been administratively disabled through port security, spanning tree or any thing else. Ruled out a cabling issue by plugging a laptop directly into the switch. A restart of the switch solved the problem, however my confidence in the switch had deminished so it was swapped out last week.

I'm guessing there was perhaps always an underlying hardware/interface issue which was tripping up the Hulc LED Process and that newer versions of the IOS just deal with the issue better. So I think several people are correct in different ways in this thread. Thanks to all who have helped.