Solved: Re: VSS force-switchover failed

vokabakov · ‎02-14-2019

Hi all,

I need to move currently active chassis (switch 1) of our VSS so I wanted to force-switchover to swap their roles. Unfortunatelly after "redundancy force-switchover" the former active chassis did not come up properly. I saw it present in sh switch virtual as a Standby and the former Standby was Active, but I did not see any modules ( show modules switch 1 was empty ) and all lines (sh ip int b) of the switch 1 were down.

At switch 2 I saw that its VSL control link Te2/5/4 were in UP/DOWN state.

After reload of all VSS it came back to the original configuration (switch 1 active, switch 2 standby) and everything works properly.

Could it be caused by higher priority of Switch 1? ( switch 1 priority 110 ) If not, what could cause this issue?

Richard

Mark Malone · ‎02-15-2019

the fact it didn't generate crash file may mean was just in some hung state process jammed etc , now its rebooted probably of wiped/reset the issue most likely so finding the actual problem may be very difficult unless you can replicate the issue and debug it while it takes place capture everything

You could check with TAC or check the release guides for the version your running see if any known bugs match what you seen , running a clean show tech now may not provide anything after the reboot

View solution in original post

vokabakov · ‎02-16-2019

Hi Mark,
today I tried again force-switchover and it was successful :-). There was probably really just some hung state process jammed etc. as you mentioned yesterday. Anyway thank you very much for your hints and support :-)
Richard

View solution in original post

Leo Laohoo · ‎02-15-2019

I suspect it's gone into ROMmon probably due to not the same IOS versions.

vokabakov · ‎02-15-2019

Hi Leo,
thank you for your response.
I dont think it is this kind of issue. Both switches are in VSS currently and they are working without any problem. I expect that if one of them has different IOS they would not be able to form virtual switch at all. Right now there are switch 1 as active and switch 2 as standby and I would like to just swap them, because I need to move switch 1 to different location on Monday and I would like to minimize impact to our users.
Richard

Mark Malone · ‎02-15-2019

is the ios loaded in bootflash and slave bootflash on the switch that failed ?

vokabakov · ‎02-15-2019

Yes, it is.

boot system flash sup-bootdisk:/s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin

RTR#sh sup-bootflash:
-#- --length-- -----date/time------ path
1 175267636 Apr 9 2011 09:26:08 +02:00 s72033-ipservicesk9_wan-vz.122-33.SXI5.bin
2 33554432 Apr 9 2011 09:20:32 +02:00 sea_console.dat
3 33554432 Apr 9 2011 12:07:56 +02:00 sea_log.dat
4 140062020 Oct 21 2015 20:08:12 +02:00 s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin
5 1769720 Apr 26 2018 12:04:18 +02:00 tftp

640319488 bytes available (384237568 bytes used)

RTR#sh slavesup-bootflash:
-#- --length-- -----date/time------ path
1 175267636 Apr 9 2011 09:26:06 +02:00 s72033-ipservicesk9_wan-vz.122-33.SXI5.bin
2 33554432 Apr 9 2011 09:20:26 +02:00 sea_console.dat
3 33554432 Apr 9 2011 12:15:46 +02:00 sea_log.dat
4 19907 Apr 3 2013 10:03:46 +02:00 Vc_Qos_Tests_2413
5 19069 Dec 23 2013 15:55:52 +01:00 Before_Sense_Chg_231213
6 140062020 Oct 21 2015 19:59:56 +02:00 s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin

642039808 bytes available (382517248 bytes used)

RTR#sh sup-bootdisk:
-#- --length-- -----date/time------ path
1 175267636 Apr 9 2011 09:26:08 +02:00 s72033-ipservicesk9_wan-vz.122-33.SXI5.bin
2 33554432 Apr 9 2011 09:20:32 +02:00 sea_console.dat
3 33554432 Apr 9 2011 12:07:56 +02:00 sea_log.dat
4 140062020 Oct 21 2015 20:08:12 +02:00 s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin
5 1769720 Apr 26 2018 12:04:18 +02:00 tftp

640319488 bytes available (384237568 bytes used)

RTR#sh slavesup-bootdisk:
-#- --length-- -----date/time------ path
1 175267636 Apr 9 2011 09:26:06 +02:00 s72033-ipservicesk9_wan-vz.122-33.SXI5.bin
2 33554432 Apr 9 2011 09:20:26 +02:00 sea_console.dat
3 33554432 Apr 9 2011 12:15:46 +02:00 sea_log.dat
4 19907 Apr 3 2013 10:03:46 +02:00 Vc_Qos_Tests_2413
5 19069 Dec 23 2013 15:55:52 +01:00 Before_Sense_Chg_231213
6 140062020 Oct 21 2015 19:59:56 +02:00 s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin

642039808 bytes available (382517248 bytes used)

Mark Malone · ‎02-15-2019

any crash files when you run ... dir all-filesystems
sometimes located in slavecrashinfo you may see a crash file at same time you had your issue

vokabakov · ‎02-15-2019

nope. There is no crashfile anywhere :-/

It is quite strange, because when I issued force-switchover Switch 2 become active without any issue and after a while I was able to see also that Switch 1 booted up and become Standby as I needed. So VSL had to communicate with both chassis and see Switch 1 up, but there was some issue on a protocol layer. VSL interfaces on a Switch 2 had Status UP, but Protocol Down.

When I forced reload of whole VSS, both chassis rebooted and came back to the original state - Switch 1 was Active and Switch 2 was Standby and everything has started to work again :-/

Mark Malone · ‎02-15-2019

the fact it didn't generate crash file may mean was just in some hung state process jammed etc , now its rebooted probably of wiped/reset the issue most likely so finding the actual problem may be very difficult unless you can replicate the issue and debug it while it takes place capture everything

You could check with TAC or check the release guides for the version your running see if any known bugs match what you seen , running a clean show tech now may not provide anything after the reboot

vokabakov · ‎02-15-2019

i will try it again tomorrow during a day and I will see. Maybe it was caused just by some strange circumstances. I also tried reload only the Switch 1 by "redundancy reload shelf 1" but it did not help. Maybe something was stuck on a Switch 2 which became Active.

Mark Malone · ‎02-15-2019

if your able to replicate , set the terminal to record everything from the start
Has this worked previously for you or have you never initiated a failover this way before , if its untested there may be something in the config causing/missing , maybe worth checking everything again

Make sure nothing looks off with the outputs from these commands , just some things id check to be sure its not on your end
show redundancy and sh vsl 1 lmp status / sh vsl 2 lmp status / sh switch virtual ro

if it happens again and the config is all good move off the version and retest if no TAC support

vokabakov · ‎02-15-2019

Thank you Mark.

I just tried commands you posted and maybe there could be an issue. If i ran "sh vsl 1 lmp status" I got it is working properly with no failure, but for 2 it returned emply result :-/. I guess it is not correct result, right?

RTR#sh vsl 1 lmp status

LMP Status

Last operational Current packet Last Diag Time since
Interface Failure state State Result Last Diag
-------------------------------------------------------------------------------
Te1/5/4 No failure Hello bidir Never ran --
Te1/5/5 No failure Hello bidir Never ran --

RTR#sh vsl 2 lmp status

RTR#

RTR#sh switch virtual ro

Switch Switch Status Priority Role Session ID
Number Oper(Conf) Local Remote
------------------------------------------------------------------
LOCAL 1 UP 110(110) ACTIVE 0 0
REMOTE 2 UP 100(100) STANDBY 6034 3756

In dual-active recovery mode: No

RTR#sh redundancy
Redundant System Information :
------------------------------
Available system uptime = 15 hours, 0 minutes
Switchovers system experienced = 0
Standby failures = 0
Last switchover reason = none

Hardware Mode = Duplex
Configured Redundancy Mode = sso
Operating Redundancy Mode = sso
Maintenance Mode = Disabled
Communications = Up

Current Processor Information :
-------------------------------
Active Location = slot 1/5
Current Software state = ACTIVE
Uptime in current state = 14 hours, 59 minutes
Image Version = Cisco IOS Software, s72033_rp Software (s72033_rp-IPSERVICESK9_WAN-M), Version 12.2(33)SXJ6, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2013 by Cisco Systems, Inc.
Compiled Fri 19-Jul-13 03:30 by prod_rel_team
BOOT = sup-bootdisk:/s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin,1;
Configuration register = 0x2102

Peer Processor Information :
----------------------------
Standby Location = slot 2/5
Current Software state = STANDBY HOT
Uptime in current state = 14 hours, 55 minutes
Image Version = Cisco IOS Software, s72033_rp Software (s72033_rp-IPSERVICESK9_WAN-M), Version 12.2(33)SXJ6, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2013 by Cisco Systems, Inc.
Compiled Fri 19-Jul-13 03:30 by prod_rel_team
BOOT = sup-bootdisk:/s72033-ipservicesk9_wan-mz.122-33.SXJ6.bin,1;
Configuration register = 0x2102

Mark Malone · ‎02-15-2019

Yes that doesnt look right should be like below from 1 of my VSS you should see both , check the config on the VSL ports are on the second switch correctly , maybe your using ports 2/5/4 & 2/5/5

Executing the command on VSS member switch role = VSS Active, id = 1

LMP Status

          Last operational        Current packet          Last Diag   Time since
Interface Failure state           State                   Result      Last Diag
-------------------------------------------------------------------------------
Te1/3/7   No failure              Hello bidir             Never ran   --
Te1/3/8   No failure              Hello bidir             Never ran   --

Executing the command on VSS member switch role = VSS Standby, id = 2

1#sh vsl 2 lmp status

Executing the command on VSS member switch role = VSS Active, id = 1

Executing the command on VSS member switch role = VSS Standby, id = 2

LMP Status

          Last operational        Current packet          Last Diag   Time since
Interface Failure state           State                   Result      Last Diag
-------------------------------------------------------------------------------
Te2/3/7   No failure              Hello bidir             Never ran   --
Te2/3/8   No failure              Hello bidir             Never ran   --

##################################################

interface TenGigabitEthernet1/3/7
description Peer VSL Do Not Move
switchport mode trunk
switchport nonegotiate
no lldp transmit
no lldp receive
no cdp enable
channel-group 100 mode on
service-policy output VSL-Queuing-Policy
end

xir-b101uas01#sh run int Te2/3/7
Building configuration...

Current configuration : 241 bytes
!
interface TenGigabitEthernet2/3/7
description Peer VSL Do Not Move
switchport mode trunk
switchport nonegotiate
no lldp transmit
no lldp receive
no cdp enable
channel-group 101 mode on
service-policy output VSL-Queuing-Policy

vokabakov · ‎02-15-2019

my fault ... i had to run it directly on standby switch and not from the active one

RTR-sdby-sp#sh vslp lmp status

Instance #2:
LMP Status

Last operational Current packet Last Diag Time since
Interface Failure state State Result Last Diag
-------------------------------------------------------------------------------
Te2/5/4 No failure Hello bidir Never ran --
Te2/5/5 No failure Hello bidir Never ran --

Mark Malone · ‎02-15-2019

But if you ran a force switch failover i would expect to see this below , last time i did this on my 6509s this is the output but then you rebooted the chassis after which is probably why it doesn't show it

i think replicate and try and capture as much as you can as the issue occurs , if it doesn't occur again it may be a once off process issue

sh vslp lmp status

Instance #1:

LMP Status

Last operational Current packet Last Diag Time since
Interface Failure state State Result Last Diag
-------------------------------------------------------------------------------
Te1/5/4 Dis:Peer Reload Request Hello bidir Never ran --
Te1/5/5 Dis:Peer Reload Request Hello bidir Never ran --

vokabakov · ‎02-16-2019

Hi Mark,
today I tried again force-switchover and it was successful :-). There was probably really just some hung state process jammed etc. as you mentioned yesterday. Anyway thank you very much for your hints and support :-)
Richard