9800-CL - Active role switch issues in HA SSO deployment

Jegan Rajappa · ‎02-01-2021

Hi,

I am not sure how many of you have noticed active role switchover issues in 9800-CL HA SSO deployment, in my case I have 2 x 9800-CL in same ESXi host, Gig 2 interface is connected to vSwitch1 which is in trunking and promiscuous mode accept, Gig 3 (RP Link) is connected to vSwitch2 which is host only, both the 9800-CL is connected to same vSwitch

The problem is that the active role keep switching between chassis 1 and 2 continuously, all the vmware features which is listed in deployment guide (like snapshot, vMotion, DRS) are already disabled.

This problem is more seen in 17.3.1 and less in 17.3.2a

Anyone experiencing same like me? TAC is still investigating and no resolution so far.

WLC#show redundancy switchover history
Index  Previous  Current  Switchover             Switchover
       active    active   reason                 time
-----  --------  -------  ----------             ----------
   1       1        2     active unit removed    23:16:17 UTC Sat Jan 9 2021
   2       2        1     active unit removed    01:13:17 UTC Sun Jan 10 2021
   3       1        2     active unit removed    03:12:11 UTC Sun Jan 10 2021
   4       2        1     active unit removed    08:33:30 UTC Sun Jan 10 2021
   5       1        2     active unit removed    07:21:46 UTC Tue Jan 19 2021
   6       2        1     active unit removed    08:08:30 UTC Sat Jan 23 2021
   7       1        2     active unit removed    13:03:52 UTC Sun Jan 24 2021
   8       2        1     active unit removed    19:23:34 UTC Sun Jan 24 2021
   9       1        2     active unit removed    22:58:37 UTC Sun Jan 24 2021
   10      2        1     active unit removed    00:14:49 UTC Mon Jan 25 2021

WLC#

marce1000 · ‎02-01-2021

- https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvq66554 , check if the provided workaround can help

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Jegan Rajappa · ‎02-01-2021

Thank you marce!

Just made the changes, I will monitor and update.

Scott Fella · ‎02-01-2021

I have the same setup with no issues. I have tried majority of the 17.x code versions also. Have you tried to just rebuild it at all? With virtual, it’s pretty easy to build one offline and then take one of the active down and replace it. When you add it back, it will sync with the active. I have been testing the 9800’s for a while and had multiple builds just to test how I would recover a failure or how to rebuild. What I have done now since my design is FlexConnect, I have two 9800-CL’s in N+1 because I don’t see any need for SSO.
Have you tried to spin up new 9800’s in SSO on another ESXi host just to make sure it’s nothing with the host or with the 9800’s on that host?

-Scott
*** Please rate helpful posts ***

Jegan Rajappa · ‎02-01-2021

Hi Scott, Yes, I have five HA SSO deployments across globe and noticed this problem in four deployments, luckily not in one deployment and not sure why.

In those four deployments, three deployment seems stable since last 3 weeks after upgrading from 17.3.1 to 17.3.2a, however one deployment is still giving trouble, I have changed the timers as per the workaround mentioned in CSCvq66554

Scott Fella · ‎02-01-2021

If TAC hasn’t identified the bug, that would possibly mean that there is no relationship between what you are seeing and that bug. Keep working with TAC and see. You should also look at your VMware environment and see if there is something different. Meaning versions, services running, etc. I will be honest, if 17.3.2a is working, I would spin up a brand new instance and migrate to that instance. What I have seen in the past is when I have had installs of previous versions and upgraded to newer ones over and over and also beta. Now it’s probably because I test beta images, I don’t know, but something was wrong with the controller and upgrading or downgrading didn’t help. Spinning up a new instance takes 15-30 minutes and is cleaner.

-Scott
*** Please rate helpful posts ***