Re: c7609 arbitrary reboot

aleks · ‎07-30-2018

Hello.

Can somebody help me please!?

ROM: System Bootstrap, Version 12.2(17r)S4, RELEASE SOFTWARE (fc1)
BOOTLDR: Cisco IOS Software, c7600s72033_rp Software (c7600s72033_rp-ADVIPSERVICESK9-M), Version 15.1(2)S, RELEASE SOFTWARE (fc1)

METROAGG1 uptime is 3 hours, 23 minutes
Uptime for this control processor is 3 hours, 23 minutes
System returned to ROM by s/w reset at 22:07:14 UTC Mon Feb 27 2012 (SP by bus error at PC 0x4048B1F4, address 0x0)
System restarted at 13:03:39 EEDT Mon Jul 30 2018
System image file is "disk0:c7600s72033-advipservicesk9-mz.151-2.S.bin"
Last reload type: Normal Reload

cisco CISCO7609-S (R7000) processor (revision 1.0) with 983008K/65536K bytes of memory.
Processor board ID FOX1343GPS9
SR71000 CPU at 600MHz, Implementation 1284, Rev 1.2, 512KB L2 Cache
Last reset from s/w reset
2 Virtual Ethernet interfaces
242 Gigabit Ethernet interfaces
8 Ten Gigabit Ethernet interfaces
1917K bytes of non-volatile configuration memory.
8192K bytes of packet buffer memory.

I have two reboot with reason (sup was rsp720-3c):

1.EEDT: %C7600_PLATFORM-2-PEER_RESET: RP is being reset by the SP %Software-forced reload

2.EEDT: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed due to exception , reset by [5/0]

%Software-forced reload

Then I change the supervisor to sup720-3bxl and it reboot again with reason:

3.EEDT: %SYS-SP-6-MEMDUMP: 0x80839F0: 0x1 0x0 0x1000001 0x517415F8

%Software-forced reload

Daniele Giordano · ‎07-30-2018

RP (routing Processor) is being reset by the SP (switch processor).

Please, add crash info to better understand the reason.

Regards.

aleks · ‎07-30-2018

Hello!

_20180728-121918 - %CPU_MONITOR-3-PEER_EXCEPTION

20180728-000112 - %C7600_PLATFORM-2-PEER_RESET

After replacing the supervisor and repeating the problem, I only have a couple of variants:

- slot?

- chassis?

Thanks.

Daniele Giordano · ‎07-30-2018

Hi, in the first file I see

1700511: Jul 27 07:22:07.624 EEDT: %MAC_MOVE-SP-4-NOTIF: Host e446.da52.4b3c in vlan 172 is flapping between port Gi9/47 and port Gi9/31
1700512: Jul 27 08:50:48.055 EEDT: %MAC_MOVE-SP-4-NOTIF: Host 901b.0eed.0383 in vlan 124 is flapping between port Gi9/27 and port Gi9/32
1700513: Jul 28 00:01:12.411 EEDT: %C7600_PLATFORM-2-PEER_RESET: RP is being reset by the SP

The MAC FLAPPING is tipically caused by a loop in the network.

In the second file I see

000093: Jul 28 12:19:18.690 EEDT: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed due to exception , reset by [5/0]

A similar behaviour is described in the bug CSCti22719 but seems related to traffic pattern.

Try to investigate for loops and try to test a different IOS.

Regards.

Leo Laohoo · ‎07-30-2018

I am fairly certain this crash is due to a memory leak, CSCtw80533. (As of 31 July 2018, there are >205 TAC Cases, so this bug is very, very well known.)

The chassis is using a very, very old code: 15.2(1)S. This is version "0" (no number after the letter "S").

1700512: Jul 27 08:50:48.055 EEDT: %MAC_MOVE-SP-4-NOTIF: Host 901b.0eed.0383 in vlan 124 is flapping between port Gi9/27 and port Gi9/32
1700513: Jul 28 00:01:12.411 EEDT: %C7600_PLATFORM-2-PEER_RESET: RP is being reset by the SP

Look at the time and date. It looks like a silent leak. No log entries.

My recommendation is to upgrade the firmware of the chassis to something more recent.

aleks · ‎07-30-2018

Thanks for your help!

aleks · ‎08-13-2018

Hello!

I updated the software to c7600rsp72043-adventerprisek9-mz.155-3.S4.bin, then worked about 2 weeks and rebooted again.

016252: Aug 13 22:29:46.192 EEDT: %PFREDUN-SP-6-ACTIVE: Standby processor removed or reloaded, changing to Simplex mode
016253: Aug 13 22:29:46.192 EEDT: %OIR-SP-3-PWRCYCLE: Card in module 6, is being power-cycled (Module reset)
016254: Aug 13 22:29:48.344 EEDT: %SNMP-5-MODULETRAP: Module 6 [Down] Trap
016255: Aug 13 22:30:23.404 EEDT: %C7600_PLATFORM-2-PEER_RESET: RP is being reset by the SP

%Software-forced reload

Any ideas?

Leo Laohoo · ‎08-13-2018

Post the latest crashinfo file.

aleks · ‎08-13-2018

crashinfo_Active sup - from active sup

crashinfo_STANDBY HOT - from standby hot sup

Thanks for help.

Leo Laohoo · ‎08-14-2018

000364: *Aug 13 23:44:00.715 EEDT: %CONST_DIAG-SP-3-HM_TEST_FAIL: Module 6 TestSPRPInbandPing consecutive failure count:26
000365: *Aug 13 23:44:00.715 EEDT: %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=8% RP=2% Traffic=1%
netint_thr_active[0], Tx_Rate[2301], Rx_Rate[6688], dev=1[IPv4, fail=1], 2[IPv4, fail=10], 3[IPv4, fail=21]
000366: *Aug 13 23:44:00.715 EEDT: %CONST_DIAG-SP-4-HM_TEST_WARNING: Sup switchover will occur after 10 consecutive failures

Raise a TAC Case. Whatever is Module 6 this is causing the supervisor card to reload..

aleks · ‎08-14-2018

module 6 is the second supervisor that works like STANDBY HOT

Leo Laohoo · ‎08-14-2018

Eject module 6 and see if the system is stable.

aleks · ‎08-22-2018

Hello!

I eject module 6, the system worked for 10 days, and again rebooted :((((((

Aug 23 09:05:39.193 EEDT: %SYS-SP-3-OVERRUN: Block overrun at 3C06A7D0 (red zone 04657188)
-Traceback= 81E8CD0z 83DC5D0z 83DD5D0z 83DD900z 83DDAE0z 840F648z 840A854z
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MTRACE: mallocfree: addr, pc
67E7F60,83E5554 67E7F60,83E55D8 67E7F60,30000084 67E7F60,83E5554
4403F00,6000001E 4403E10,83E5554 4403E10,83E55D8 4403E10,40000060
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MTRACE: mallocfree: addr, pc
67E7F60,83E55D8 67E7F60,30000084 67E7F60,83E5554 4403F00,6000001E
4403E10,83E5554 4403E10,83E55D8 4403E10,40000060 67E7F60,83E55D8
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-BLKINFO: Corrupted redzone blk 3C06A7D0, words 136, alloc 824C83C, InUse, dealloc 0, rfcnt 1
-Traceback= 81E8CD0z 83B797Cz 83DC5E8z 83DD5D0z 83DD900z 83DDAE0z 840F648z 840A854z
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MEMDUMP: 0x3C06A7D0: 0xAB1234CD 0xFFFE0000 0x0 0xA3C94DC
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MEMDUMP: 0x3C06A7E0: 0x824C83C 0x3C06A910 0x3C06A6A4 0x80000088
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MEMDUMP: 0x3C06A7F0: 0x1 0xFFCA11F6 0x1000001 0x113A1654

%Software-forced reload

Aug 23 09:05:39.225 EEDT: %DIAG-SP-3-NO_DIAG_RUNNING: Module 5: Diagnostic is not running

09:05:39 EEDT Thu Aug 23 2018: Unexpected exception to CPU: vector 1500, PC = 0x840EEAC , LR = 0x840EE44

-Traceback= 0x840EEACz 0x840EE44z 0x83DD5D0z 0x83DD900z 0x83DDAE0z 0x840F648z 0x840A854z

Can it still be the fault of the chassis?

Unfortunately we do not have Cisco TAC support.

aleks · ‎08-23-2018

Hello.

I eject module 6, the system worked for 10 days and rebooted again.

Aug 23 09:05:39.193 EEDT: %SYS-SP-3-OVERRUN: Block overrun at 3C06A7D0 (red zone 04657188)
-Traceback= 81E8CD0z 83DC5D0z 83DD5D0z 83DD900z 83DDAE0z 840F648z 840A854z
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MTRACE: mallocfree: addr, pc
67E7F60,83E5554 67E7F60,83E55D8 67E7F60,30000084 67E7F60,83E5554
4403F00,6000001E 4403E10,83E5554 4403E10,83E55D8 4403E10,40000060
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MTRACE: mallocfree: addr, pc
67E7F60,83E55D8 67E7F60,30000084 67E7F60,83E5554 4403F00,6000001E
4403E10,83E5554 4403E10,83E55D8 4403E10,40000060 67E7F60,83E55D8
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-BLKINFO: Corrupted redzone blk 3C06A7D0, words 136, alloc 824C83C, InUse, dealloc 0, rfcnt 1
-Traceback= 81E8CD0z 83B797Cz 83DC5E8z 83DD5D0z 83DD900z 83DDAE0z 840F648z 840A854z
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MEMDUMP: 0x3C06A7D0: 0xAB1234CD 0xFFFE0000 0x0 0xA3C94DC
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MEMDUMP: 0x3C06A7E0: 0x824C83C 0x3C06A910 0x3C06A6A4 0x80000088
Aug 23 09:05:39.193 EEDT: %SYS-SP-6-MEMDUMP: 0x3C06A7F0: 0x1 0xFFCA11F6 0x1000001 0x113A1654

%Software-forced reload

Aug 23 09:05:39.225 EEDT: %DIAG-SP-3-NO_DIAG_RUNNING: Module 5: Diagnostic is not running

09:05:39 EEDT Thu Aug 23 2018: Unexpected exception to CPU: vector 1500, PC = 0x840EEAC , LR = 0x840EE44

-Traceback= 0x840EEACz 0x840EE44z 0x83DD5D0z 0x83DD900z 0x83DDAE0z 0x840F648z 0x840A854z

Can the problem still be in the chassis?

Unfortunately, we do not have cisco TAC support.

Leo Laohoo · ‎08-23-2018

@aleks wrote:

%Software-forced reload

Looks like a software bug.

Eject whatever is in slot 6 and leave it out. Let's see if there are more any issues for the next, say, 30 days.