cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5673
Views
9
Helpful
8
Replies

Cat 4500 Supervisor 7-E is rebooting

Hi. I'm having a trouble with a 4510R+E switch.  The WS-X45-SUP7-E supervisors are constantly rebooting, and we can see a 'critical software exception reload' reason. We have checked the logs but we haven't found anything that could help us to troubleshoot this issue.

______ uptime is 5 weeks, 1 day, 19 hours, 3 minutes

Uptime for this control processor is 37 minutes

System returned to ROM by reload

System restarted at 18:12:00 GMT-5 Thu Apr 25 2013

Running default software

Last reload reason: Critical software exception

The supervisors are rebooting almost every 30 minutes. I found this log, but i guess it has to do with a non-cisco transceiver that is being used:

000016: Apr 25 18:13:00.032 GMT-5: %GBIC_SECURITY-4-SECURITY_DISABLED: STANDBY:Unsupported transceiver support enabled

** The IOS Version is 03.02.00.SG

What can we do to solve this incident?. Everything was working fine, and suddently the problem arose. Could upgrading the IOS version help? Reseating the supervisors?

Thanks a lot.

Fabio.


1 Accepted Solution

Accepted Solutions

Hi Fabio,

There is crash info file been generated on the device if in case you get that would be helpfull.

your switch had crashed with the following errors present in the crash log:

8/000044: Apr 25 16:37:12.824 GMT-5: %SYS-2-CHUNKBADMAGIC: Bad magic number in chunk header, chunk 88A0AC70  data 88A130A8  chunkmagic 5DA3C78B  chunk_freemagic 0 -Process= "Check heaps", ipl= 0, pid= 7

-Traceback= 1#d1c16a2e6f8ceec1892aa2e47e2a2618  :10000000+FCC0E4 :10000000+1BEE4E4 :10000000+1BEE794 :10000000+1C168F00

This is a memory corruption issue. Within IOS, each chunk of memory has

a "header" followed by the data itself. The header describes various

attributes about the data, including what type of data is held within

the chunk. Any chunks of memory that are freed are given what is known

as a "CHUNKFREEMAGIC" code within the header. In your case, we have

determined that this field had been corrupted, varying only a few bits

from the expected value.

After contacting internal groups that support the Supervisor 7E, I have

been referred to bug CSCtr35883, which is a similar sort of memory

corruption on your version of IOS, identical except for the fact that

the individual who originally hit the bug had seen corruption in a

different field. from your crash logs, it appears that you are definitely running

into this defect. Apparently, several bits are being written into the

corrupt chunk of memory from the previous chunk that comes just before

it, a phenomenon known as "overflow."

HTH

Regards

Inayath

View solution in original post

8 Replies 8

InayathUlla Sharieff
Cisco Employee
Cisco Employee

HI FAbio

First thing first.....what software/hardware changes have been made before this issue triggered?

- Can you just remove the SFP module which is not supported is plugged on to this switch and update me the result.?

- Get me the output of show module, show platform crashdump ,show logging,show process cpu sorted | ex 0.00

-If the device is not in production kindly reseat the module once. In case if you have redundant module try to failover and update us the result if no redundancy then try inserting the sup on different slot.

Kindly furnish this information to me and I will try to findout the rootcause for the same.

HTH

Regards

Inayath

*Plz rate all usefull posts.

Hi Inayath. Fisrt of all, thank you so much for your help.

This is a client's switch that we are currently monitoring (remotely). We have asked them about changes being made recently, but they said nothing has been done (nor hardware neither software). As I mentioned, the log doesn't show too much. As you can see, the problem is that I do not have the Switch here, I'm monitoring it remotely, and It's client's property. So, I need to troubleshoot the issue and send them a solution to this incident (or instructions that could help solving the problem). That's why I was wondering if I should ask them to reseat the module or try something with the transceiver.

#show module

Chassis Type : WS-C4510R+E

Power consumed by backplane : 40 Watts

Mod Ports Card Type                              Model              Serial No.

---+-----+--------------------------------------+------------------+-----------

1    48  10/100BaseTX (RJ45)V, Cisco/IEEE       WS-X4248-RJ45V     JAE1631073V

2    48  10/100BaseTX (RJ45)                    WS-X4148-RJ        JAE162708IS

3    48  10/100BaseTX (RJ45)                    WS-X4148-RJ        JAE162708KK

4    48  10/100/1000BaseT Premium POE E Series  WS-X4748-RJ45V+E   CAT1529L40G

5     4  Sup 7-E 10GE (SFP+), 1000BaseX (SFP)   WS-X45-SUP7-E      CAT1634L27U

6     4  Sup 7-E 10GE (SFP+), 1000BaseX (SFP)   WS-X45-SUP7-E      CAT1634L26S

7    48  10/100BaseTX (RJ45)                    WS-X4148-RJ        JAE162708LJ

8    48  10/100BaseTX (RJ45)                    WS-X4148-RJ        JAE162708IO

9    48  10/100/1000BaseT UPOE E Series         WS-X4748-UPOE+E    CAT1615L0QB

M MAC addresses                    Hw  Fw           Sw               Status

--+--------------------------------+---+------------+----------------+---------

1 a44c.117d.6c40 to a44c.117d.6c6f 4.2                               Ok      

2 a493.4c3e.50a0 to a493.4c3e.50cf 3.4                               Ok      

3 a493.4c3e.4aa0 to a493.4c3e.4acf 3.4                               Ok      

4 44d3.ca96.6fc0 to 44d3.ca96.6fef 1.2                               Ok      

5 fc99.471f.5240 to fc99.471f.5243 2.1 15.0(1r)SG5  03.02.00.SG      Ok      

6 fc99.471f.5244 to fc99.471f.5247 2.1 15.0(1r)SG5  03.02.00.SG      Ok      

7 a493.4c3e.5010 to a493.4c3e.503f 3.4                               Ok      

8 a493.4c3e.50d0 to a493.4c3e.50ff 3.4                               Ok      

9 a44c.1107.f73c to a44c.1107.f76b 1.1                               Ok      

Mod  Redundancy role     Operating mode      Redundancy status

----+-------------------+-------------------+----------------------------------

5   Active Supervisor   SSO                 Active                           

6   Standby Supervisor  SSO                 Standby hot                      

show proc cpu sort | e 0.0

Core 0: CPU utilization for five seconds: 3%; one minute: 3%; five minutes: 3%

Core 1: CPU utilization for five seconds: 16%; one minute: 16%; five minutes: 16%

PID    Runtime(ms) Invoked  uSecs  5Sec     1Min     5Min     TTY   Process

10189  506319      1703208  2235   11.16796 11.71777 12.10644 0     iosd 

show logg

Syslog logging: enabled (0 messages dropped, 1 messages rate-limited, 0 flushes, 0 overruns, xml disabled, filtering disabled)

Log Buffer (128000 bytes):

*Apr 25 23:41:37.652: %C4K_IOSSYS-6-IMAGELEVEL: Supervisor booting in image level 'entservices'

*Apr 25 23:41:37.716: %C4K_REDUNDANCY-6-INIT: STANDBY:Initializing as STANDBY supervisor

*Apr 25 23:41:42.764: %C4K_REDUNDANCY-6-DUPLEX_MODE: STANDBY:The peer Supervisor has been detected

*Apr 25 23:41:42.785: %C4K_REDUNDANCY-3-COMMUNICATION: STANDBY:Communication with the peer Supervisor has been established

Apr 25 23:42:08.855: %C4K_REDUNDANCY-6-MODE: STANDBY:STANDBY supervisor initializing for sso mode

Apr 25 23:42:12.502: %SPANTREE-5-EXTENDED_SYSID: STANDBY:Extended SysId enabled for type vlan

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 1 (WS-X4248-RJ45V S/N: JAE1631073V Hw: 4.2) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 2 (WS-X4148-RJ S/N: JAE162708IS Hw: 3.4) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 3 (WS-X4148-RJ S/N: JAE162708KK Hw: 3.4) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 4 (WS-X4748-RJ45V+E S/N: CAT1529L40G Hw: 1.2) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 5 (WS-X45-SUP7-E S/N: CAT1634L27U Hw: 2.1) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 6 (WS-X45-SUP7-E S/N: CAT1634L26S Hw: 2.1) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 7 (WS-X4148-RJ S/N: JAE162708LJ Hw: 3.4) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 8 (WS-X4148-RJ S/N: JAE162708IO Hw: 3.4) is online

Apr 25 23:42:13.621: %C4K_IOSMODPORTMAN-6-MODULEONLINE: STANDBY:Module 9 (WS-X4748-UPOE+E S/N: CAT1615L0QB Hw: 1.1) is online

000016: Apr 25 18:42:19.566 GMT-5: %GBIC_SECURITY-4-SECURITY_DISABLED: STANDBY:Unsupported transceiver support enabled

000017: Apr 25 18:42:19.618 GMT-5: %SYS-6-CLOCKUPDATE: STANDBY:System clock has been updated from 18:42:19 GMT-5 Thu Apr 25 2013 to 18:42:19 GMT-5 Thu Apr 25 2013, configured from console by console.

000018: Apr 25 18:42:24.993 GMT-5: %SSH-5-DISABLED: STANDBY:SSH 2.0 has been disabled

000019: Apr 25 18:42:29.150 GMT-5: %SYS-5-RESTART: STANDBY:System restarted --

Cisco IOS Software, IOS-XE Software, Catalyst 4500 L3 Switch Software (cat4500e-UNIVERSALK9-M), Version 03.02.00.SG RELEASE SOFTWARE (fc4)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2011 by Cisco Systems, Inc.

Compiled Tue 26-Apr-11 18:55 by prod_rel_team

000020: Apr 25 18:42:29.180 GMT-5: %SSH-5-ENABLED: STANDBY:SSH 2.0 has been enabled

000021: .Apr 25 18:42:30.227 GMT-5: %SSH-5-DISABLED: STANDBY:SSH 2.0 has been disabled

000022: .Apr 25 18:42:30.228 GMT-5: %SSH-5-ENABLED: STANDBY:SSH 2.0 has been enabled

000023: .Apr 25 19:07:30.111 GMT-5: %C4K_REDUNDANCY-6-INIT: Initializing as ACTIVE supervisor

000024: .Apr 25 19:07:30.164 GMT-5: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been lost

000025: .Apr 25 19:07:30.176 GMT-5: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 172.22.164.10 Port 514 started - CLI initiated

000026: .Apr 25 19:07:30.177 GMT-5: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 172.22.127.197 Port 514 started - CLI initiated

000027: .Apr 25 19:07:30.178 GMT-5: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 172.22.45.163 Port 514 started - CLI initiated

000028: .Apr 25 19:07:30.188 GMT-5: %C4K_REDUNDANCY-3-SIMPLEX_MODE: The peer Supervisor has been lost

000029: Apr 25 19:11:01.617 GMT-5: %C4K_REDUNDANCY-6-DUPLEX_MODE: The peer Supervisor has been detected

000030: Apr 25 19:11:40.271 GMT-5: %C4K_IOSMODPORTMAN-6-MODULEONLINE: Module 6 (WS-X45-SUP7-E S/N: CAT1634L26S Hw: 2.1) is online

000031: Apr 25 19:11:40.279 GMT-5: %C4K_REDUNDANCY-6-MODE: ACTIVE supervisor initializing for sso mode

000032: Apr 25 19:11:40.794 GMT-5: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been established

000033: Apr 25 19:11:52.438 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The bootvar has been successfully synchronized to the standby supervisor

000034: Apr 25 19:11:52.439 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The config-reg has been successfully synchronized to the standby supervisor

000035: Apr 25 19:11:52.440 GMT-5: %C4K_REDUNDANCY-5-CALENDAR: The calendar has been successfully synchronized to the standby supervisor for the first time

000036: Apr 25 19:11:52.440 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The startup-config has been successfully synchronized to the standby supervisor

000037: Apr 25 19:11:53.013 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The private-config has been successfully synchronized to the standby supervisor

000038: Apr 25 19:11:54.906 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC_RATELIMIT: The vlan database has been successfully synchronized to the standby supervisor

000039: Apr 25 19:12:47.441 GMT-5: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded

000040: Apr 25 19:12:47.442 GMT-5: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)

000041: 000021: .Apr 25 19:12:28.837 GMT-5: %SSH-5-DISABLED: STANDBY:SSH 2.0 has been disabled

000042: 000022: .Apr 25 19:12:28.838 GMT-5: %SSH-5-ENABLED: STANDBY:SSH 2.0 has been enabled

Thanks a lot

I'd check to see if the line card is properly seated in the slot.

If it is, pull it out and call TAC for an RMA.  I'm suspecting the power module could be failing.

HI Fabio,

I have been reviewing logs and it looks like the switchover was triggered as the

standby detected loss of heartbeat messages from the active .

000032: Apr 25 19:11:40.794 GMT-5: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been established

000033: Apr 25 19:11:52.438 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The bootvar has been successfully synchronized to the standby supervisor

000034: Apr 25 19:11:52.439 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The config-reg has been successfully synchronized to the standby supervisor

000035: Apr 25 19:11:52.440 GMT-5: %C4K_REDUNDANCY-5-CALENDAR: The calendar has been successfully synchronized to the standby supervisor for the first time

000036: Apr 25 19:11:52.440 GMT-5: %C4K_REDUNDANCY-5-CONFIGSYNC: The startup-config has been successfully synchronized to the standby supervisor

Before I proceed further, can you please let me know the following-

1) If we can get the output of show tech then I may give it a try by decoding the traces to tell you the exact rootcause. Atleast get me the output of show platform crash dump output.

2) If this is not possible then go ahead and RMA the module it might happen that the ASIC on the module might have gone bad.

HTH

REgards

Inayath

Hi Inayath

I've just uploaded the show tech. Let me know if you need anything else. Thanks a lot.

Fabio.

Hi Fabio,

There is crash info file been generated on the device if in case you get that would be helpfull.

your switch had crashed with the following errors present in the crash log:

8/000044: Apr 25 16:37:12.824 GMT-5: %SYS-2-CHUNKBADMAGIC: Bad magic number in chunk header, chunk 88A0AC70  data 88A130A8  chunkmagic 5DA3C78B  chunk_freemagic 0 -Process= "Check heaps", ipl= 0, pid= 7

-Traceback= 1#d1c16a2e6f8ceec1892aa2e47e2a2618  :10000000+FCC0E4 :10000000+1BEE4E4 :10000000+1BEE794 :10000000+1C168F00

This is a memory corruption issue. Within IOS, each chunk of memory has

a "header" followed by the data itself. The header describes various

attributes about the data, including what type of data is held within

the chunk. Any chunks of memory that are freed are given what is known

as a "CHUNKFREEMAGIC" code within the header. In your case, we have

determined that this field had been corrupted, varying only a few bits

from the expected value.

After contacting internal groups that support the Supervisor 7E, I have

been referred to bug CSCtr35883, which is a similar sort of memory

corruption on your version of IOS, identical except for the fact that

the individual who originally hit the bug had seen corruption in a

different field. from your crash logs, it appears that you are definitely running

into this defect. Apparently, several bits are being written into the

corrupt chunk of memory from the previous chunk that comes just before

it, a phenomenon known as "overflow."

HTH

Regards

Inayath

Hi Inayath.

That was indeed the problem. I found some info about bug CSCtr35883, and it is related to Catalyst 4500 Switches running IOS version 03.02.00.SG, which is the one our switch has.

Reading other forum, I found that they recommended to upgrade the IOS version in order to solve the issue. So, we asked the client to upgrade to IOS version 03.02.05.SG. Once they did that it, the supervisors stopped rebooting.

Thanks a lot for your help!.

Fabio.

Hi,

Our network contains two 4507+E Cisco switches and another two SUP 7L-E.Their IOS was cat4500e-universalk9.SPA.03.07.00.E.152-3.E.bin.
MACsec technology on a communication port channel having two 3850 Stack Cisco switches was used.
Afterwards, one of the switches started experiencing random reboots every couple of days while displaying in show version reload reason
"software critical exception" after each restart.

Checking the logs,this error was being seen: "%C4K_SWITCHINGENGINEMAN-4-VFEIMINTERRUPT: Q.8�)L�d"
Supervisors were upgraded to 3.8.6 and 3.8.1 versions but it did not address the problem.
Rommon was upgraded to cat4500-e-ios-promupgrade-150-1r-SG14 version and it did not solved the issue as well.
VSL Links expansion from 4 GB to 20 GB tried and the restarts were still happening.

Finally, MACsec was removed from its communication link with the Cisco 3850 Stack and it did the trick!

I am still wondering whether Cisco does have any solution for this problem?

regards.