cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5312
Views
17
Helpful
8
Replies

7604 RSP720-3CXL-GE systematic crash on boot

paketuzer
Level 1
Level 1

Hello,

This router was running fine since a few years.

We had to do an electric maintenance yesterday morning and the router had to be electrically shutdown.

When rebooted it started crashing systematically after booting.

We've tried to remove linecards, just keep the RSP720 module.

Disconnect electrically everything for a while.

we've tried 2 different IOS versions.

But still the same it does a software crashes.

Looking into old syslogs I found the following error repeating:

SP: FAILED to write to DS1338 RTC device

Has anyone had a similar issue?

Thanks,

the only errors/odd messages I found during the boot: (otherwise check the attachement)

booting 15.1.3-S4

*Aug  7 09:48:19.103: %DIAG-3-CARD_ABSENT:      è4xÁ¬è is not detected

*Aug  7 09:48:19.107: scp assert failure: queue != NULL: ../const/native/scp_const.c: 940

*Aug  7 09:48:19.107: -Traceback= 81BDC04z 86ABD28z 86AC3F0z 8C7BF6Cz 83A9A4Cz 83A9C3Cz 83850A0z 83977A0z 837D8B4z 8C7C000z 83A90A0z 83A32D4z

*Aug  7 09:48:19.107: %SCHED-7-WATCH: Attempt to monitor uninitialized watched queue (address 0). -Process= "slcp process", ipl= 0, pid= 138

-Traceback= 81BC060z 837F07Cz 86AC400z 8C7BF6Cz 83A9A4Cz 83A9C3Cz 83850A0z 83977A0z 837D8B4z 8C7C000z 83A90A0z 83A32D4z

*Aug  7 09:48:20.175: %OIR-6-CONSOLE: Changing console ownership to switch processor

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xd77a148

PC = 0x83a74e0, Vector = 0x1500, SP = 0x146a6bc8

000028: *Aug  7 11:53:26.165: %DIAG-SP-6-RUN_MINIMUM: Module 1: Running Minimal Diagnostics...

%Software-forced reload

11:53:34 CEST Wed Aug 7 2013: Unexpected exception to CPU: vector 1500, PC = 0xB6A5E84 , LR = 0xB6A5E18

-Traceback= 0xB6A5E84z 0xB6A5E18z 0xB341684z 0xB190AFCz 0xB1DD090z 0xB1DD0F4z 0x8D600BCz 0x8D61450z 0x8D61FA4z 0x8D6255Cz 0xB06435Cz 0xB064AFCz 0xB71C0F0z 0xB71C430z 0xB71DE90z 0xB69692Cz

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0x12e4c850

PC = 0xb6a5e84, Vector = 0x1500, SP = 0x1ba51fa8

System Bootstrap, Version 12.2(33r)SRD5, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 2009 by cisco Systems, Inc.

*Aug  7 11:53:36.013: %SYS-SP-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

*Aug  7 11:53:35.229: %C7600_ENV-SP-4-FANCOUNTFAILED: Required number of fan trays is not present

*Aug  7 11:53:36.013: %OIR-SP-6-CONSOLE: Changing console ownership to switch processor

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xd77a148

PC = 0x83a74e0, Vector = 0x1500, SP = 0x17ecd450

booting 12.2.33-SRE1

000019: *Aug  7 12:06:36.975: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 1 became active.

000020: *Aug  7 12:06:38.987: %DIAG-SP-6-RUN_MINIMUM: Module 1: Running Minimal Diagnostics...

%Software-forced reload

12:06:46 CEST Wed Aug 7 2013: Unexpected exception to CPU: vector 1500, PC = 0xAE94130 , LR = 0xAE940F4

-Traceback= 0xAE94130 0xAE940F4 0xAF27184 0xAE85F2C 0xAF41170 0xAF41170 0xAE85FE8 0xAE90BAC 0xAE90C30 0xAE8D178

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xe5b73c8

PC = 0xae94130, Vector = 0x1500, SP = 0x158a99b8

e

*Aug  7 12:06:48.059: %SYS-SP-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

*Aug  7 12:06:46.895: %EARL-SP-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch invocations in the last 30 secs have been attempted. Max limit reached

*Aug  7 12:06:48.059: %OIR-SP-6-CONSOLE: Changing console ownership to switch processor

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xd247660

PC = 0x835948c, Vector = 0x1500, SP = 0x16026890

8 Replies 8

Vignesh Rajendran Praveen
Cisco Employee
Cisco Employee

Hello,

I have completed analyzing the outputs provided by you. Kindly find the Problem Analysis & Next Action Plan as below.

Problem Analysis:-

=================

The RSP720 comprises of the Switch Processor & the Route Processor. The switch processor I believe is set with a different configuration register value & the configuration register on the route processor is different. The 0x0 was set on the SP it means that "break" is enabled.  So the switch must has been receiving a signal that it interpreted as "break" and it was causing the router to reload.

Next Action Plan:-

=================

Change the configuration-register on the Switch Processor & Route Processor to uniformly 0x2102.

Method to change it is as below.

Step 1

Remove the RSP720 out of the chassis. Ensure that there is enough power supplies & fan trays to support normal operation on the chassis. Now insert the RSP720 into the chassis. While it boots up, wait for the below log to be dispalyed.

%OIR-6-CONSOLE: Changing console ownership to switch processor

As soon as you see the above message, break into SP ROMMON mode.

Now set the config reg to 0x2102 -> then do a sync -> then reset

Step 2

After that kindly wait till the below message is seen.

%OIR-SP-6-CONSOLE: Changing console ownership to route processor

As soon as you see the above message, break into RP ROMMON mode.

Now set the config reg to 0x2102 -> then do a sync  -> then reset

This should help in resolving the issue.

****Plz do rate this post without fail if you found it to be helpful*********

Thanks & Regards,

Vignesh R P

Hello,

With respect to "SP: FAILED to write to DS1338 RTC device" error log, kindly find the explanantion below.

DS1338 IC is the Real Time Clock (RTC) with NVRAM. The DS1338 serial real-time clock (RTC) is a low-power, full binary-coded decimal (BCD)clock/calendar plus 56 bytes of NV SRAM. It seems the router failed to write information to the chip, however, from 'show clock', seems the clock currently is not been affected.

The error is cosmetic and should not have any performance impact on the router.

Also let me know on which IOS version these error logs were seen.

****Plz do rate this post without fail if you found it to be  helpful*********

Thanks & Regards,

Vignesh R P

Hello Vignesh,

Please note that a some point I configured the following: confreg 0x2142

in the hope that the router would manage to boot cleanly without loading the full configuration.

But that didn't help,

I will now try to change back the confreg on both SP and RP

I'll get back to you when done.

But the fact that we have software-forced reload linked to an unexpected exception to CPU, makes me think it is not confreg related.

Regarding the RTC failure message:

we were seeing that on 12.2(33)-SRE1

I couldn't find those logs anymore after we tried 15.1.3-S4

thanks

Hello,

I still believe that the issue faced by you is due to the config-reg values. Kindly follow the action plan provided by me and it would help you in resolving the problem.

Regarding the RTC failure message, it is due to Cisco IOS BUG CSCts20541. This BUG is hits 12.2(33)SRE1 but I guess should be fixed in the other code you are using.

****Plz do rate this post without fail if you found it to be helpful*********

Thanks & Regards,

Vignesh R P

Hi Vignesh,

I've done a fresh boot sesssion ( from electrial power-on )

short version: after configuring 0x2102, it boots and crashes 3 times and then gives up.

belew you will find to most interesting steps.

and attached the full log.

the first steps were before power-cycling:

rommon 4 > BOOT=bootdisk:c7600rsp72043-ipservicesk9-mz.122-33.SRE1.bin

rommon 5 > sync

rommon 6 >

disconnect power, remove rsp, insert rsp, power-on

System Bootstrap, Version 12.2(33r)SRD5, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 2009 by cisco Systems, Inc.

C7600-RSP720/SP platform with 1048576 Kbytes of main memory

rommon 1 > set

PS1=rommon ! >

RELOAD_TYPE=1

NT_K=0:0:0:0

SLOTCACHE=cards;

LOG_PREFIX_VERSION=1

RET_2_RTS=06:29:05 CEST Tue Jul 23 2013

RET_2_RCALTS=

RANDOM_NUM=599429381

?=0

BSI=0

PF_REDUN_CRASH_COUNT=0

CRASHINFO=crashinfo_FAILED

BOOT=bootdisk:c7600rsp72043-ipservicesk9-mz.122-33.SRE1.bin


rommon 2 > confreg 0x2102

You must reset or power cycle for new config to take effect

rommon 3 > sync

rommon 4 > reset

Resetting .......

System Bootstrap, Version 12.2(33r)SRD5, RELEASE SOFTWARE (fc1)

<....>

Cisco IOS Software, c7600rsp72043_sp Software (c7600rsp72043_sp-IPSERVICESK9-M), Version 12.2(33)SRE1, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2010 by Cisco Systems, Inc.

Compiled Mon 29-Mar-10 22:56 by prod_rel_team

*Aug  8 10:22:06.327: %SYS-SP-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

*Aug  8 10:22:03.551: %PFREDUN-6-ACTIVE: Initializing as ACTIVE processor

*Aug  8 10:22:06.327: %OIR-SP-6-CONSOLE: Changing console ownership to route processor

I hit ctrl-break ( fn-break on my thinkpad )

System Bootstrap, Version 12.2(33r)SRD5, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 2009 by cisco Systems, Inc.

C7600-RSP720/RP pltform with 2097152 Kbytes of main memory

rommon 1 >

so I guess I'm in RP rommon now.

I configure confreg 0x2012 and the reset

rommon 1 > confreg 0x2102

rommon 2 > sync

rommon 3 > set

PS1=rommon ! >

RELOAD_TYPE=1

LOG_PREFIX_VERSION=1

SLOTCACHE=cards;

BOOT=sup-bootdisk:c7600rsp72043-advipservicesk9-mz.151-3.S4.bin,1;sup-bootdisk:c7600rsp72043-ipservicesk9-mz.122-33.SRE1.bin,1;

?=0

RET_2_RTS=12:06:24 CEST Thu Aug 8 2013

RET_2_RCALTS=1375956389

CRASHINFO=bootdisk:crashinfo_20130808-120624-CEST


rommon 4 > confreg

           Configuration Summary

   (Virtual Configuration Register: 0x2102)

enabled are:

[ 0 ] load rom after netboot fails

[ 1 ] console baud: 9600

boot: ...... image specified by the boot system commands or default to: cisco2-C7600-RSP720/RP

do you wish to change the configuration? y/n  [n]:  n

rommon 5 > reset

Resetting .......

and from there on it will auto-reboot 3 times and crash each time (software-forced reload) with the following log:

the reload always happens right after Running Minimal Diagnostics

000013: *Aug  8 10:28:27.846: %SYS-SP-5-RESTART: System restarted --

Cisco IOS Software, c7600rsp72043_sp Software (c7600rsp72043_sp-IPSERVICESK9-M), Version 12.2(33)SRE1, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2010 by Cisco Systems, Inc.

Compiled Mon 29-Mar-10 22:56 by prod_rel_team

000014: *Aug  8 10:28:29.883: %OIR-SP-6-INSPS: Power supply inserted in slot 1

000015: *Aug  8 10:28:29.883: %C7600_PWR-SP-4-PSOK: power supply 1 turned on.

000016: *Aug  8 12:28:31.879: %SNMP-5-COLDSTART: SNMP agent on host br01.gva-cogent is undergoing a cold start

000017: *Aug  8 12:28:31.899: %FABRIC-SP-5-CLEAR_BLOCK: Clear block option is off for the fabric in slot 1.

000018: *Aug  8 12:28:31.979: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 1 became active.

000019: *Aug  8 12:28:33.991: %DIAG-SP-6-RUN_MINIMUM: Module 1: Running Minimal Diagnostics...

%Software-forced reload

12:28:41 CEST Thu Aug 8 2013: Unexpected exception to CPU: vector 1500, PC = 0xAE94130 , LR = 0xAE940F4

-Traceback= 0xAE94130 0xAE940F4 0xAF27184 0xAE85F2C 0xAF41170 0xAF41170 0xAE85FE8 0x98EA348 0xAE900F8 0xAE91874 0xAE8D178

CPU Register Context:

MSR = 0x00029200  CR  = 0x20000022  CTR = 0x0B0A6B04  XER   = 0x00000000

R0  = 0x0AE940F4  R1  = 0x158A99B8  R2  = 0xFFFCFFFC  R3    = 0x156C92B0

R4  = 0x78070442  R5  = 0xDC00003C  R6  = 0x78070442  R7    = 0x00000001

R8  = 0x00029200  R9  = 0x00000000  R10 = 0x14FA117C  R11   = 0xFFB40000

R12 = 0x00000FF9  R13 = 0x04044000  R14 = 0x0EFF1F40  R15   = 0x0EFF202C

R16 = 0x00000001  R17 = 0x00000001  R18 = 0x00000000  R19   = 0x0D470000

R20 = 0x00000001  R21 = 0x0F015758  R22 = 0x0D470000  R23   = 0x0F0158B0

R24 = 0x0F015600  R25 = 0x0D470000  R26 = 0x0000FFFF  R27   = 0xFFB40000

R28 = 0x00000005  R29 = 0x0C2E8340  R30 = 0x00021200  R31   = 0x00000000

Writing crashinfo to bootdisk:crashinfo_20130808-122841-CEST

1076 Unused bytes of context save space

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xe5b73c8

PC = 0xae94130, Vector = 0x1500, SP = 0x158a99b8

e

*Aug  8 12:28:43.067: %SYS-SP-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

*Aug  8 12:28:41.899: %EARL-SP-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch invocations in the last 30 secs have been attempted. Max limit reached

*Aug  8 12:28:43.067: %OIR-SP-6-CONSOLE: Changing console ownership to switch processor

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xd247660

PC = 0x835948c, Vector = 0x1500, SP = 0x16026a18

System Bootstrap, Version 12.2(33r)SRD5, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 2009 by cisco Systems, Inc.

C7600-RSP720/SP platform with 1048576 Kbytes of main memory

Autoboot executing command: "boot bootdisk:c7600rsp72043-ipservicesk9-mz.122-33.SRE1.bin"

Initializing ATA monitor library...

Self extracting the image... [OK]

Self decompressing the image : ########################################################################################################################################################################################################################################################################## [OK]

              Restricted Rights Legend

Use, duplication, or disclosure by the Government is

subject to restrictions as set forth in subparagraph

(c) of the Commercial Computer Software - Restricted

Rights clause at FAR sec. 52.227-19 and subparagraph

(c) (1) (ii) of the Rights in Technical Data and Computer

Software clause at DFARS sec. 252.227-7013.

           cisco Systems, Inc.

           170 West Tasman Drive

           San Jose, California 95134-1706

Cisco IOS Software, c7600rsp72043_sp Software (c7600rsp72043_sp-IPSERVICESK9-M), Version 12.2(33)SRE1, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2010 by Cisco Systems, Inc.

Compiled Mon 29-Mar-10 22:56 by prod_rel_team

Active crashed three times, disabling auto-boot and dropping to rommon

%Software-forced reload

10:30:05 UTC Thu Aug 8 2013: Unexpected exception to CPU: vector 1500, PC = 0x835948C , LR = 0x8359450

-Traceback= 0x835948C 0x8359450 0x8C28ABC 0x8BE6BB8 0x8BE5C08 0x835AFD0 0x83552E8

CPU Register Context:

MSR = 0x00029200  CR  = 0x40000002  CTR = 0x08E24314  XER   = 0x00000000

R0  = 0x08359450  R1  = 0x13010ED0  R2  = 0xFFFCFFFC  R3    = 0x11453958

R4  = 0x00000008  R5  = 0x09EA4F30  R6  = 0x0E712700  R7    = 0x12FB7F20

R8  = 0x00029200  R9  = 0x00000000  R10 = 0x12FB7F20  R11   = 0x0000014A

R12 = 0x000013BC  R13 = 0x04044000  R14 = 0x08BE5BD4  R15   = 0x00000000

R16 = 0x00000000  R17 = 0x00000000  R18 = 0x00000000  R19   = 0x00000000

R20 = 0x00000000  R21 = 0x00000000  R22 = 0x00000000  R23   = 0x00000000

R24 = 0x00000000  R25 = 0x00000000  R26 = 0x0D220000  R27   = 0x00000000

R28 = 0x00000000  R29 = 0x00000007  R30 = 0x00000000  R31   = 0x00000000

------------------ show chunk failures ------------------

------------------ show redundancy states ------------------

File _20130808-103005-UTC Device Error :No such device

1085 Unused bytes of context save space

*Aug  8 10:30:06.939: %SYS-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

*Aug  8 10:30:05.675: scp assert failure: queue != NULL: ../const/native/scp_const.c: 929

*Aug  8 10:30:05.675: -Traceback= 8176198 85A3A54 85A40F4 8AFA244 835B97C 835BB50 8338FF4 834AAA0 8331954 8AFA2D8 835AFD0 83552E8

*Aug  8 10:30:05.675: %SCHED-7-WATCH: Attempt to monitor uninitialized watched queue (address 0). -Process= "slcp process", ipl= 0, pid= 135

-Traceback= 81745F8 8333068 85A4104 8AFA244 835B97C 835BB50 8338FF4 834AAA0 8331954 8AFA2D8 835AFD0 83552E8

*Aug  8 10:30:06.939: %OIR-6-CONSOLE: Changing console ownership to switch processor

*** System received a Software forced crash ***

signal= 0x17, code= 0x1500, context= 0xd247660

PC = 0x835948c, Vector = 0x1500, SP = 0x13010ed0

System Bootstrap, Version 12.2(33r)SRD5, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 2009 by cisco Systems, Inc.

C7600-RSP720/SP platform with 1048576 Kbytes of main memory

rommon 1 >

at the end it gives up and goes back to rommon

How do I know which rommon (SP or RP) I'm in?

How much time do I have to break?

I've tried the same with the 15.1.3-S4 release and the same happens.

Don't know what to do from here.

Should we look for a replacement hardware?

Thanks,

I've found another post from one month ago (Jul 3, 2013 8:55) in this forum about a similar crash:

https://supportforums.cisco.com/thread/2226462

%Software-forced reload

12:44:53 UTC Thu Jan 27 2000: Unexpected exception to CPU: vector 1500, PC = 0xAE94130 , LR = 0xAE940F4

-Traceback= 0xAE94130 0xAE940F4 0xAF27184 0xAE85F2C 0xAF41170 0xAF41170 0xAE85FE

8 0x98E6168

which is the exactly same exeption as in my case:

%Software-forced reload

12:23:51 CEST Thu Aug 8 2013: Unexpected exception to CPU: vector 1500, PC = 0xAE94130 , LR = 0xAE940F4

-Traceback= 0xAE94130 0xAE940F4 0xAF27184 0xAE85F2C 0xAF41170 0xAF41170 0xAE85FE8 0x9973158 0xAE90C30 0xAE8D178

the PC and LR are the same.

this guys was also runnning 12.2(33)SRE1

these values are of course different when booting in 15-1-3-s4

I don't know if he managed to solve his issue.

Thanks

Hello,

%ONLINE-SP-6-DNLDFAIL: Module 1, Proc. 1, Runtime image download failed because of scp send failure

This message indicates that the system was unable to download the runtime image to the RSP.


%EARL-SP-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch invocations in the last 30 secs have been attempted. Max limit reached


The error message EARL-SP-2-PATCH_INVOCATION_LIMIT are related with the EARL ASIC. The EARL is a processing chip used for handling packets coming to-from the chassis bus or switch fabric. The "EARL PATCH LIMIT" feature is a recovery mechanism that reloads a card if it fails to take control of the bus over 10 times during a 30 seconds interval.

These above two messages repeatedly seen during every time the RSP crashes makes me conclude that the RSP on the device has gone faulty. There is no other choice left than replacing it. Kindly go ahead & replace the RSP.

****Plz do rate this post without fail if you found it to be helpful*********

Thanks & Regards,

Vignesh R P

ok

that's really bad news. considering the time this RSP has served us..

anyway we're going for replacement.

I'll update this post when the replacement card is runnning.

Thanks

Review Cisco Networking for a $25 gift card