Cisco 6509/2 720 Sups random ROMMON boots

letsrollusa11 · ‎12-28-2012

Looking for some support here. Got a long lingering, year long issue that has spanned about 8 supervisor cards and a complete chassis swap. The 6509 acts as a ITN in our facility. The active sup card at random points of pipe usage boots into rommon mode, seriously inhibiting our company. I'm able to swap the 2 fiber pairs that we had going into the active supervisor card into the secondary and usually this works for another random amount of time, however, today it occured within minutes of hooking up the fiber links. Sitting there for about 5 minutes it booted into ROMMON. When this happens, I'm able to boot the sup card back to good status. Previous remedial actions, other than replacing sup cards/chassis, was checking the config register and making sure it was x2102. Previously, it was not, and we corrected and reloaded and it took, we thought this would fix the problem, until today. Please let me know what you see in these log files. Thank you so much.

The point at which I believe the first supervisor card went into ROMMON:

--More-- 6668: 12/28/12 15:27:58 5/-1: MAJ, GOLD, loopback_unused_port[1/43]: check_ether_packet failed

6669: 12/28/12 15:27:58 5/-1: MAJ, GOLD, check_ether_packet_core [1/43]: diag_hit_sys_limit returns TRUE! The current mod/port test may be skipped.

6670: 12/28/12 15:27:58 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [1/43]: fabric buffer rate low in lc_chan 9 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please reduc

6671: 12/28/12 15:27:58 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [1/43]: fabric buffer rate low in sup_chan 19 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please red

6672: 12/28/12 15:27:58 5/-1: MAJ, GOLD, DIAG_SYS_LIMIT (sp_last_busy_percent, tx_rate, rx_rate, sp_netint_thr) NOT HIT

6673: 12/28/12 15:27:58 5/-1: MAJ, GOLD, check_ether_packet_core[1/-1/43]: newpak is NULL!

6674: 12/28/12 15:27:57 5/-1: MAJ, GOLD, loopback_unused_port[1/43]: check_ether_packet failed

6675: 12/28/12 15:27:57 5/-1: MAJ, GOLD, check_ether_packet_core [1/43]: diag_hit_sys_limit returns TRUE! The current mod/port test may be skipped.

6676: 12/28/12 15:27:57 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [1/43]: fabric buffer rate low in lc_chan 9 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please reduc

--More-- 6677: 12/28/12 15:27:57 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [1/43]: fabric buffer rate low in sup_chan 19 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please red

6678: 12/28/12 15:27:57 5/-1: MAJ, GOLD, DIAG_SYS_LIMIT (sp_last_busy_percent, tx_rate, rx_rate, sp_netint_thr) NOT HIT

6679: 12/28/12 15:27:57 5/-1: MAJ, GOLD, check_ether_packet_core[1/-1/43]: newpak is NULL!

6680: 12/28/12 15:27:55 5/-1: MAJ, GOLD, diag_get_fabric_status[8]: diag_hit_sys_limit: Test skipped!

6681: 12/28/12 15:27:55 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [8/1]: fabric buffer rate low in lc_chan 16 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please reduc

6682: 12/28/12 15:27:55 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [8/1]: fabric buffer rate low in sup_chan 19 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please redu

6683: 12/28/12 15:27:55 5/-1: MAJ, GOLD, DIAG_SYS_LIMIT (sp_last_busy_percent, tx_rate, rx_rate, sp_netint_thr) NOT HIT

6684: 12/28/12 15:27:55 5/-1: MAJ, GOLD, loopback_internal_port[8/6]: check_bpdu_packet failed (err-code 0)

--More-- 6685: 12/28/12 15:27:55 5/-1: MAJ, GOLD, check_bpdu_packet_core[8/6]: diag_hit_sys_limit returns TRUE! The current mod/port test may be skipped.

6686: 12/28/12 15:27:55 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [8/6]: fabric buffer rate low in lc_chan 16 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please reduc

6687: 12/28/12 15:27:55 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [8/6]: fabric buffer rate low in sup_chan 19 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please redu

6688: 12/28/12 15:27:55 5/-1: MAJ, GOLD, DIAG_SYS_LIMIT (sp_last_busy_percent, tx_rate, rx_rate, sp_netint_thr) NOT HIT

6689: 12/28/12 15:27:55 5/-1: MAJ, GOLD, check_bpdu_packet_core[8/6]: newpak is NULL!

6690: 12/28/12 15:27:54 5/-1: MAJ, GOLD, loopback_internal_port[8/6]: check_bpdu_packet failed (err-code 0)

6691: 12/28/12 15:27:54 5/-1: MAJ, GOLD, check_bpdu_packet_core[8/6]: diag_hit_sys_limit returns TRUE! The current mod/port test may be skipped.

6692: 12/28/12 15:27:54 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [8/6]: fabric buffer rate low in lc_chan 16 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please reduc

6693: 12/28/12 15:27:54 5/-1: MAJ, GOLD, diag_check_fabchan_flow_control_status_ok [8/6]: fabric buffer rate low in sup_chan 19 with flow control status in_lo 1 in_hi 0 out_lo 0 out_hi 0! Flow control is occuring in the system. Please redu

--More-- 6694: 12/28/12 15:27:54 5/-1: MAJ, GOLD, check_bpdu_packet_core[8/6]: newpak is NULL!

6695: 12/28/12 15:09:39 5/-1: MAJ, GOLD, diag_get_port_tx_count_changed: Tx count is unchanged for mod=4 port=8 count=0 prev_count=0

6696: 12/28/12 15:09:39 5/-1: MAJ, GOLD, diag_get_port_tx_count_changed: Tx count is unchanged for mod=4 port=7 count=0 prev_count=0

6697: 12/28/12 15:09:39 5/-1: MAJ, GOLD, diag_get_port_tx_count_changed: Tx count is unchanged for mod=4 port=6 count=0 prev_count=0

6698: 12/28/12 15:09:39 5/-1: MAJ, GOLD, diag_get_port_tx_count_changed: Tx count is unchanged for mod=4 port=5 count=0 prev_count=0

6699: 12/28/12 15:09:39 5/-1: MAJ, GOLD, diag_get_port_tx_count_changed: Tx count is unchanged for mod=4 port=4 count=0 prev_count=0

6700: 12/28/12 15:09:39 5/-1: MAJ, GOLD, diag_get_port_tx_count_changed: Tx count is unchanged for mod=4 port=3 count=0 prev_count=0

Leo Laohoo · ‎12-28-2012

Are there any crashinfo files? If you have, please post the latest three.

If you've done a complete chassis and supervisor swap then the only suspect left is the IOS.

letsrollusa11 · ‎12-28-2012

Going to be a couple days until I get the crashinfo, but I do have some info from another log.

NVRAM log:

01. 10/17/2012 15:47:15: pf_redun_negotiate:Initialized as ACTIVE processor

02. 10/17/2012 15:54:34: pf_redun_negotiate:Initialized as ACTIVE processor

03. 10/17/2012 15:57:40: oir_reload_chassis:Reload Chassis requested

04. 10/17/2012 16:00:08: pf_redun_negotiate:Initialized as ACTIVE processor

05. 10/27/2012 03:10:23: pf_redun_negotiate:Initialized as ACTIVE processor

06. 10/27/2012 06:17:05: pf_redun_announce_standby_init:Initialized as STANDBY processor

07. 10/27/2012 06:41:00: pf_redun_announce_standby_init:Initialized as STANDBY processor

08. 12/03/2012 00:06:52: pf_redun_track_switchover:STANDBY Supervisor switching-over to ACTIVE

09. 12/03/2012 00:06:52: oir_reload_peer:Card in slot 6, is being power-cycled

10. 12/03/2012 00:10:57: oir_reload_peer:Card in slot 6, is being power-cycled

11. 12/04/2012 11:56:22: oir_reload_peer:Card in slot 6, is being power-cycled

12. 12/15/2012 11:02:38: oir_cycle_bay:Card in module 7, is being power-cycled - off (Diagnostic Failure) (22)

13. 12/15/2012 11:06:19: oir_cycle_bay:Card in module 7, is being power-cycled - off (Diagnostic Failure) (22)

14. 12/20/2012 21:17:40: oir_reload_chassis:Reload Chassis requested

15. 12/21/2012 05:05:18: pf_redun_negotiate:Initialized as ACTIVE processor

16. 12/21/2012 00:28:08: oir_cycle_bay:Card in module 7, is being power-cycled - off (Diagnostic Failure) (22)

17. 12/21/2012 00:31:36: oir_cycle_bay:Card in module 7, is being power-cycled - off (Diagnostic Failure) (22)

18. 12/28/2012 15:48:17: nvlog_crash_hook:System crashed on slot 5. Check value of Rommon variable CRASHINFO

19. 12/29/2012 00:15:58: pf_redun_announce_standby_init:Initialized as STANDBY processor

20. 12/28/2012 17:03:11: pf_redun_track_switchover:STANDBY Supervisor switching-over to ACTIVE

21. 12/28/2012 17:03:11: oir_reload_peer:Card in slot 6, is being power-cycled

22. 12/28/2012 17:07:16: oir_reload_peer:Card in slot 6, is being power-cycled

23. 12/28/2012 17:13:36: fabric_error_handle:recurring errors: 1(ch1,side 6), 4(ch2,side 4) for this slot

Version Info:

------------------ show version ------------------

Cisco IOS Software, s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXI7, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Compiled Mon 18-Jul-11 05:49 by prod_rel_team

ROM: System Bootstrap, Version 12.2(17r)SX7, RELEASE SOFTWARE (fc1)

ITN0243 uptime is 1 week, 21 hours, 8 minutes

Uptime for this control processor is 1 hour, 57 minutes

Time since ITN0243 switched to active is 1 hour, 24 minutes

System returned to ROM by Stateful Switchover at 15:48:23 PST Fri Dec 28 2012 (SP by bus error at PC 0x417CA66C, address 0x0)

System restarted at 16:31:40 PST Fri Dec 28 2012

System image file is "sup-bootdisk:s72033-advipservicesk9_wan-mz.122-33.SXI7.bin"

Last reload reason: reload

Leo Laohoo · ‎12-28-2012

063941: Dec 28 15:38:48.641 PST: %CONST_DIAG-SP-3-HM_TEST_FAIL: Module 5 TestSPRPInbandPing consecutive failure count:5
063942: Dec 28 15:38:48.641 PST: %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=42% RP=5% Traffic=0%
netint_thr_active[0], Tx_Rate[33942], Rx_Rate[1276], dev=1[IPv4, fail=5], 2[IPv4, fail=5], 3[IPv4, fail=5], 4[IPv6, fail=5]
063943: Dec 28 15:38:48.641 PST: %CONST_DIAG-SP-4-HM_TEST_WARNING: Sup switchover will occur after 10 consecutive failures
063944: Dec 28 15:40:32.841 PST: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks
063945: Dec 28 15:41:40.946 PST: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to 132.33.195.2 on interface Vlan195
063946: Dec 28 15:42:21.174 PST: %SVCLC-5-FWTRUNK: Firewalled VLANs configured on trunks
063947: Dec 28 15:43:21.019 PST: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to 132.33.195.2 on interface Vlan195
063948: Dec 28 15:43:28.003 PST: %CONST_DIAG-SP-3-HM_TEST_FAIL: Module 5 TestSPRPInbandPing consecutive failure count:10
063949: Dec 28 15:43:28.011 PST: %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=37% RP=7% Traffic=0%
netint_thr_active[0], Tx_Rate[36545], Rx_Rate[6618], dev=1[IPv4, fail=10], 2[IPv4, fail=10], 3[IPv4, fail=10], 4[IPv6, fail=10]
063950: Dec 28 15:43:28.011 PST: %CONST_DIAG-SP-4-HM_TEST_WARNING: Sup switchover will occur after 10 consecutive failures

Your crashinfo file attached reads like a sad country western song. Is this a FWSM/ASA? It's causing issues with your entire chassis.

I've got a strange feeling you have a backplane issue.

letsrollusa11 · ‎12-28-2012

Little digging, FWSM is the firewall service module on one of the blades in the 6509.

Leo Laohoo · ‎12-29-2012

12. 12/15/2012 11:02:38: oir_cycle_bay:Card in module 7, is being power-cycled - off (Diagnostic Failure) (22)

13. 12/15/2012 11:06:19: oir_cycle_bay:Card in module 7, is being power-cycled - off (Diagnostic Failure) (22)

And this looks like something is really happening to the entire chassis' backplane.