I have seen this issue in all its various forms, whether relating to the SFPs, the stacking modules or the switch itself. I was recently upgrading all 2960X stacks to remediate a security vulnerability. I successfully upgraded 49 stacks at various sites without incident. I am unsure how many switches that represents but at least 150. Then I moved on to a site in Europe with 22 switches in very small stacks of 1 - 3 switches. I loaded them all with 15.2(4)E7 and before my maintenance window came one of the stacks of 2 switches reloaded due to a power failure. I received the POST: ACT2 Authentication : End, Status Failed boot message followed by the %ILET-1-AUTHENTICATION_FAIL: This Switch may not have been manufactured by Cisco IOS message on one of them while the other one upgraded successfully. So they grabbed a brand new spare off the shelf and I did an IOS upgrade on that before adding to the stack. It also failed with the same error. A second spare upgraded successfully. I tried everything I could think of to resurrect the 2 switches. Tried the published power cycle trick with no success. Tried factore reset, various IOS images including the original one with no success. Finally called TAC and had the 2 failed switches RMA'ed. Through some mixup they sent 4 switches. One of the switches was DOA. Since then I have upgraded 10 additional switches one stack at a time and have had 2 more failures. I have 10 more to go but we need to RMA some more switches first. I need to be sure that I have enough switches on hand to cover a failure of every stack member. Luckily at this site the most switches in any one stack is 3. My biggest fear is an extended power outage at the plant and they will all reload. How many will survive is anyone's guess. It would appear that once this occurs there is no software remedy so by definition this is a hardware failure of some sort. I have to believe Cisco knows what causes the failure and what serial numbers might be affected. But I have never seen this published. My theory is that there is such a huge number of possibly defective switches that they are just not willing to proactively replace them all. So that means I have to slog through these upgrades one at a time and hope for the best. -Jeff
... View more
I changed the address on a pair of 5508's in HA mode this weekend. I was unable to find a detailed procedure to do this so I thought I would post it here. These are running 184.108.40.206. First, I changed the primary controller for all of the access points to my new address using Prime. If you do not have Prime you would need to issue this command for each AP: config ap primary-base <wlc name> <wlc address> Or from the GUI, select an AP, click on the High Availability tab and enter the WLC name and new address as the primary address. You could also use your current address as primary and the new address as secondary. I also enabled ssh for the AP's globally so I could connect to them remotely if I had problems getting them to join on the new address. Luckily I didn't. Procedure PRIMARY: Disable WLANS config wlan disable all PRIMARY: Disable HA mode (controllers will reboot) config redundancy mode disable STANDBY: Change the management IP address: config interface address management 10.0.0.19 255.255.255.240 10.0.0.17 STANDBY: Change the VLAN assignment on the management interface config interface vlan management 23 STANDBY: Change the redundancy management address: config interface address redundancy-management 10.0.0.21 peer-redundancy-management 10.0.0.20 STANDBY: Enable all ports config port adminmode all enable PRIMARY: Change the management IP address config interface address management 10.0.0.18 255.255.255.240 10.0.0.17 PRIMARY: Change the VLAN assignment on the management interface config interface vlan management 23 PRIMARY: Change the redundancy management address config interface address redundancy-management 10.0.0.20 peer-redundancy-management 10.0.0.21 PRIMARY & STANDBY: Enable HA mode. Controllers will reboot. Issue on primary first, then standby. No need to wait for primary to complete bootup before issuing on standby config redundancy mode sso PRIMARY: Enable WLANS config wlan enable all Verify (WLC1) >show interface summary Number of Interfaces.......................... 5 Interface Name Port Vlan Id IP Address Type Ap Mgr Guest -------------------------------- ---- -------- --------------- ------- ------ ----- management LAG 23 10.0.0.18 Static Yes No redundancy-management LAG 23 10.0.0.20 Static No No redundancy-port - untagged 169.254.0.20 Static No No service-port N/A N/A 0.0.0.0 Static No No virtual N/A N/A 220.127.116.11 Static No No (WLC1) >show redundancy summary Redundancy Mode = SSO ENABLED Local State = ACTIVE Peer State = STANDBY HOT Unit = Primary Unit ID = 4C:00:82:71:E6:40 Redundancy State = SSO Mobility MAC = 4C:00:82:71:E6:40 BulkSync Status = Complete Average Redundancy Peer Reachability Latency = 428 Micro Seconds Average Management Gateway Reachability Latency = 2099 Micro Seconds Don't forget to change the network device address for the WLC in ISE. After I did this it still would not authenticate wireless users. I was getting this error for everything in the live log: 5441 Endpoint started new session while the packet of previous session is being processed. Dropping new session. I had seen a similar problem in the past though I can't remember what caused it. I restarted ISE and authentications started working again. I think there may be a command to clear the cache so that a restart isn't necessary but I am not sure what that is. So just thought this might help someone. I invite and welcome any improvements to this procedure. -Jeff
... View more
I just upgraded a pair of 4500-X VSS switches at a remote site from 3.5.2 to 3.8.7. In the process I upgraded rommon from 15.0(1r)SG(10) to 15.0(1r)SG(15). I configured the bootvar thusly: BOOT variable = bootflash:firmwareupgradeallK10-150_1r_SG15.SPA,1;bootflash:cat4500e-universalk9.SPA.03.08.07.E.152-4.E7.bin,1; performed a "redundancy reload shelf". After about 45 minutes I called the site. Was told that all the LEDs on the front of both switches were dark. PSU LEDs were flashing green. I had him pull the power cords from both supplies on one of the switches and reconnect. Everything came up normally with the new rommon and new IOS. I suspect that the switches just powered themselves off after the rommon upgrade. So just beware that this may happen to you.
... View more
Time to get off that 15.0 train. It may happen when you reload after the IOS update to a new version. It seems to me that the bug manifests itself based on the original version that you are upgrading from and not the new version. That would explain why a power cycle is needed to clear it too.
... View more
I just discovered this post which sheds some light on this condition.
I sure wish the link to the whitepaper still worked.
Here is some detail from the whitepaper.
Cisco IOS XE Software Release 3.6.0E - 15.2(2)E contains infrastructure changes in the Cisco Catalyst 4500-E switch software. Because of these changes, high-availability (HA) synchronization cannot be done between the supervisor engines loaded with Cisco IOS XE Software Release 3.6.0E - 15.2(2)E and any image version prior to Cisco IOS XE Software Release 3.6.0E - 15.2(2)E.
... View more
Unfortunately the white paper link does not work. It appears to have been taken down. Anyone have access to this document? The quoted excerpt is some very important information that I do not see reflected in the release notes.
... View more
OK, progress. I reset the standby supervisor using "redundancy reload peer". Then I interrupted the boot process using Ctrl-C from the standby console and put it into rommon. I was then able to access config mode and remove the "ntp clock period" statement. Removing this allowed me to boot 3.2.2SG into SSO mode.
So now that it was recovered I upgraded the standby rommon to 15.0(1r)SG15. Then I booted it with the 3.8.6E image. This still never establishes communication with the active supervisor. On the standby console this is the last sequence of messages that I see:
Exiting to ios... Loading gsbu64atomic as gdb64atomic Loading isp1362_hcd_k10 Using 6 for MTS slot Platform Manager: starting in standalone mode (standby)
And on the active console:
Nov 14 15:32:27: %C4K_REDUNDANCY-6-DUPLEX_MODE: The peer Supervisor has been detected Nov 14 15:34:27: %C4K_REDUNDANCY-2-HANDSHAKE_TIMEOUT_ACTIVE: The handshake messaging between active and standby has not yet started. (HANDSHAKE_TIMEOUT message repeated every 5 minutes)
I didn't expect it to come up in SSO mode, but I at least expected it to come up in RPR mode. I even tried to set the redundancy mode to RPR explicitly but it didn't change the behavior.
Switch(config-red)#do sho red states my state = 13 -ACTIVE peer state = 1 -DISABLED Mode = Duplex Unit = Primary Unit ID = 5 Redundancy Mode (Operational) = RPR Redundancy Mode (Configured) = RPR Redundancy State = RPR Manual Swact = disabled (the peer unit is still initializing) Communications = Down Reason: Failure client count = 66 client_notification_TMR = 240000 milliseconds keep_alive TMR = 9000 milliseconds keep_alive count = 0 keep_alive threshold = 18 RF debug mask = 0
Changing the redundancy mode to SSO and booting back to 3.2.2SG brings it up as "peer state = 8 -STANDBY HOT".
Is the initialization failure expected with the mismatched IOS? If I forced switchover would the standby complete initialization? I am very hesitant to do this since this is a production switch and I don't have physical access to it. Release notes do not address this condition.
Thanks for all the help!
... View more
Thanks for the reply BB.
Switch#show redundancy states my state = 13 -ACTIVE peer state = 1 -DISABLED Mode = Duplex Unit = Primary Unit ID = 5 Redundancy Mode (Operational) = Stateful Switchover Redundancy Mode (Configured) = Stateful Switchover Redundancy State = Stateful Switchover Manual Swact = disabled (the peer unit is still initializing) Communications = Down Reason: Failure client count = 66 client_notification_TMR = 240000 milliseconds keep_alive TMR = 9000 milliseconds keep_alive count = 0 keep_alive threshold = 18 RF debug mask = 0
Switch#show redundancy Redundant System Information : ------------------------------ Available system uptime = 1 year, 30 weeks, 1 day, 6 hours, 28 minutes Switchovers system experienced = 0 Standby failures = 57 Last switchover reason = none Hardware Mode = Duplex Configured Redundancy Mode = Stateful Switchover Operating Redundancy Mode = Stateful Switchover Maintenance Mode = Disabled Communications = Down Reason: Failure Current Processor Information : ------------------------------ Active Location = slot 5 Current Software state = ACTIVE Uptime in current state = 1 year, 30 weeks, 1 day, 6 hours, 25 minutes Image Version = Cisco IOS Software, IOS-XE Software, Catalyst 4500 L3 Switch Software (cat4500e-UNIVERSALK9-M), Version 03.02.02.SG RELEASE SOFTWARE (fc3) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2011 by Cisco Systems, Inc. Compiled Wed 07-Dec-11 19:55 by prod BOOT = bootflash:/cat4500e-universalk9.SPA.03.08.06.E.152-4.E6.bin,12; Configuration register = 0x102 Peer (slot: 6) information is not available because it is in 'DISABLED' state
Switch#show module Chassis Type : WS-C4510R+E Power consumed by backplane : 40 Watts Mod Ports Card Type Model Serial No. ---+-----+--------------------------------------+------------------+----------- 1 12 1000BaseX (SFP) WS-X4612-SFP-E JAE154608EW 2 12 1000BaseX (SFP) WS-X4612-SFP-E JAE1546086T 5 4 Sup 7-E 10GE (SFP+), 1000BaseX (SFP) WS-X45-SUP7-E CAT1543L00H 6 Supervisor 7 48 10/100/1000BaseT (RJ45) WS-X4648-RJ45-E JAE1539057U 9 48 10/100/1000BaseT (RJ45) WS-X4648-RJ45-E JAE15390561 M MAC addresses Hw Fw Sw Status --+--------------------------------+---+------------+----------------+--------- 1 649e.f31a.4f44 to 649e.f31a.4f4f 1.1 Ok 2 70ca.9b13.587c to 70ca.9b13.5887 1.1 Ok 5 ccef.481e.2200 to ccef.481e.2203 1.0 15.0(1r)SG2 03.02.02.SG Ok 6 Unknown Unknown Unknown Other 7 ccef.483a.96c2 to ccef.483a.96f1 1.0 Ok 9 ccef.483a.9b8a to ccef.483a.9bb9 1.0 Ok Mod Redundancy role Operating mode Redundancy status ----+-------------------+-------------------+---------------------------------- 5 Active Supervisor SSO Active 6 Standby Supervisor SSO Disabled
... View more
I still don't have a path forward with this. In order to get the standby supervisor up and running I need to make some configuration changes. But the configuration is locked. Is there any way to get around this configuration lock?
Switch(config)#I should have been a lawyer Config mode locked out until standby initializes configuration mode locked.'Please try later.' Switch(config)#
Do you suppose if we slide the standby out of the chassis it will kick the config mode open?
... View more
There is no boot variable pointing to a 2960X image. THe configuraton statement referencing that image is:
tftp-server bootflash:/c2960x-universalk9-tar.152-4.E6.tar alias ios
This seems to be causing some confusion. I was simply using the 4500 as a tftp server to upgrade 2960X's at the site.
As to reading the release notes you have a point. I did not notice this very important note:
If you are upgrading to Cisco IOS XE Release 3.8.xE and using Supervisor Engine 7-E or 7L-E, you must use ROMMON version 15.0(1r)SG10 or a higher version (if available).
I have 15.0(1r)SG2. 15.0(1r)SG15 is available.
Thank you for pointing that out. I will certainly perform that upgrade just as soon as I figure out how to get the standby supervisor back online.
... View more