unexpected reboot Active Unit HA WLC 5508

Sergej Barkovski · ‎01-08-2015

Hello,

we have an expected reboot of our active WLC5508 in HA cluster two weeks ago. The standby unit became active and because of Xmas and New Year holidays we found it only yesterday.

There is no crash file generated by the controller during reboot. Only one thing that we have is log messages sent to the syslog server by the wlc at the time of reboot:

Dec 19 00:34:39 wlc2 wlc2: *rmgrMain: Dec 19 00:34:40.113: #RMGR-0-RED_HA_RELOAD: rmgr_utils.c:216 System reboot: reason: category Peer reload req object redundancy management interface and redundancy port are down
Dec 19 00:39:57 141.79.131.4 wlc2: *rfacMain: Dec 19 00:39:58.957: #RMGR-0-RED_HA_RELOAD: rmgr_utils.c:216 System reboot: reason: category New XML downloaded object rsyncmgrXferTrasport

The first time mark is exactly the time of redundancy switchover:

(Cisco Controller) >show redundancy summary
....
Switchover Reason = Active controller failed, Switchover Time = Fri Dec
19 00:34:40 2014

The WLCs in HA have 7.6.130.0

I found two bugs with unexpected reboot for this software version

CSCur86730 and CSCuq97965

But nothing regarding the log sent by the controller.

A detail, that may be importent or may be not: the rebooted/crashed wlc was replaced by cisco 1 or 2 weeks vor dem reboot.

Do somebody have already had the same or similar issue?

Thank you and kind regards

Sergej

marce1000 · ‎01-08-2015

- Looks like your peer redundancy link may have been interrupted, making the HA STBye think it had to take over. Check these links and their statuses during your holiday period. Some offices and or rooms may have had outages for instances during the holidays.

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Sergej Barkovski · ‎01-08-2015

Hello marce1000,

we have manually the on the 19th. dec. rebooted controller to the active unit again yesterday. Today I checked the uptime on both wlc in the HA cluster:

active wlc (xxx.xxx.131.4)

(Cisco Controller) >show sysinfo

Manufacturer's Name.............................. Cisco Systems Inc.
Product Name..................................... Cisco Controller
Product Version.................................. 7.6.130.0
Bootloader Version............................... 1.0.20
Field Recovery Image Version..................... 7.6.95.16
Firmware Version................................. FPGA 1.7, Env 1.8, USB console 2.2
Build Type....................................... DATA + WPS

System Name...................................... wlc2
System Location..................................
System Contact...................................
System ObjectID.................................. 1.3.6.1.4.1.9.1.1069
Redundancy Mode.................................. SSO (Both AP and Client SSO)
IP Address....................................... xxx.xxx.131.3
Last Reset....................................... Software reset
System Up Time................................... 20 days 14 hrs 44 mins 7 secs

and the standby wlc (xxx.xxx.131.6)

(Cisco Controller-Standby) >show sysinfo

Manufacturer's Name.............................. Cisco Systems Inc.
Product Name..................................... Cisco Controller
Product Version.................................. 7.6.130.0
Bootloader Version............................... 1.0.20
Field Recovery Image Version..................... 7.6.95.16
Firmware Version................................. FPGA 1.7, Env 1.8, USB console 2.2
Build Type....................................... DATA + WPS

System Name...................................... wlc2
System Location..................................
System Contact...................................
System ObjectID.................................. 1.3.6.1.4.1.9.1.1069
Redundancy Mode.................................. SSO (Both AP and Client SSO)
IP Address....................................... xxx.xxx.131.3
Last Reset....................................... Watchdog reset
System Up Time................................... 0 days 23 hrs 56 mins 33 secs

It looks like the active unit really crahed on the 19th. of Dec. caused by unknown software error and the standby unit rebooted yesterday after switchover to another wlc caused by watchdog reset nad with with log message:

Jan 7 15:25:18 xxx.xxx.131.6 wlc2: *rfacMain: Jan 07 15:25:19.710: #RMGR-0-RED_HA_RELOAD: rmgr_utils.c:216 System reboot: reason: category New XML downloaded object rsyncmgrXferTrasport

Is it actually normal the active wlc reboot itself, if the switchover manually started or the HA link goes down?

Thanks.

Sergej

Sergej Barkovski · ‎01-08-2015

just forgot to mention, the switchover was made with command " redundancy force-switchover" via ssh connection

tonyp8581 · ‎07-31-2015

Hi Sergej, I'm wondering if you evenutally resolved your issue ?

I also have WLC5508 in HA running v7.6.130.0. I have a twist to my issue. If my primary is the active WLC, after a couple of days, it reboots with the same error as yours. However, if my backup is the active WLC, it doesn't reboot. it's pretty much stable. vvery strange.

I would appreciated some update from your part.

Thanks !

Tony

eahmed007 · ‎12-02-2015

Dear All ,

We have configured HA using WLC 5508 version 8.0 . But after few days it was rebooting unexpectedly and we have seen the below message from syslog server :

root@syslog ~]# tail -f /var/log/messages
Nov 29 03:29:01 nms rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="16103" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Dec 2 06:44:00 10.21.21.16 bkash-WLC-Primary: *rmgrMain: Dec 03 00:38:55.034: #RMGR-0-RED_HA_RELOAD: rmgr_utils.c:239 System is rebooting, reason: XMLs were not trasferred from Active to Standby

Can anyone help me to find out the solution .I am extremly waiting for your reply.

Thanks and regards

Erfan

Scott Fella · ‎12-02-2015

v8.0.120.0? Dumb question, but was everything setup right and how long has this been working until you noticed the reboots? You have a back to back Ethernet cable from the RP ports?

-Scott

-Scott
*** Please rate helpful posts ***

eahmed007 · ‎12-02-2015

Hi Scott , We already connected back to back cable on RP port. We have configured peer service port in Redundancy .We already face similar types of problem 3 to 4 times .

WLC IOS version is : 8.0.115.0.

Can you tell why it's happning unexpected reboot and it showing XMLs are not transfering from active to standby .

For your information , If secondary WLC is in Active mode than We have seen that unexpcted reboot would be happened .

I am expecting you prompt reply and support.

Thanks and regards

Erfan

Scott Fella · ‎12-02-2015

I would not use that code. Go with v8.0.121.0 and verify that the image on both controllers are the same. Might be the image you have. I have SSO working in many environments with v7.6.130.0, v8.0.120.0, and v8.0.121.0.

-Scott

-Scott
*** Please rate helpful posts ***

eahmed007 · ‎12-03-2015

Hi Scott ,

Thanks for your reply.

Is it a bug on IOS version 8.0.115.0. of WLC ??

For your informaiton , Actually it was working fine few days after each reboot and suddenly it does reboot .But we couldn't find out any reason on this strange behaviour of WLC in HA mode.

We have configured service peer ip in redundancy tab .Is it required to configure HA in WLC.

It would be highly appreciated if you reply.

Thanks and regards

Erfan

Scott Fella · ‎12-03-2015

I don't know for sure if it's a bug, but that code isn't recommended. I have to assume also that when you setup SSO that you followed the guide and it is properly setup.

http://www.cisco.com/c/en/us/td/docs/wireless/controller/technotes/7-5/High_Availability_DG.html

There are two ways to setup HA, one is SSO which you are doing and the other is N+1.

http://www.cisco.com/c/en/us/td/docs/wireless/technology/hi_avail/N1_High_Availability_Deployment_Guide.pdf

I would break up the SSO and upgrade both controllers to v8.0.121.0. But first I would probably factory reset the secondary HA controller and go through the the startup wizard just to bring it online. Then after you have your code upgraded, any third party cents and webauth portal pages, then go setup SSO again. This is what I would do since your having this reboot issues for a long time. I don't think your going to fix that with the code your running and you would probably have to break up SSO anyways. The steps are in the guide to break up SSO. You pretty much just have to disable SSO on both units via CLI.

-Scott

-Scott
*** Please rate helpful posts ***

eahmed007 · ‎12-08-2015

Hi scott ,

Can you tell that do we need to enable or disable gateway reachibility option in HA mode in WLC as we are getting unexpted reboot in WLC.

Today we have chaged the RP port cable to check whether its happing for cable or not.

Please have a look the below details for your reference

(Cisco Controller) >show redundancy summary ?

(Cisco Controller) >show redundancy summary
            Redundancy Mode = SSO ENABLED
                Local State = ACTIVE
                 Peer State = STANDBY HOT
                       Unit = Primary
                    Unit ID = 74:A2:E6:C7:6F:E0
           Redundancy State = SSO
               Mobility MAC = 74:A2:E6:C7:6F:E0
Management Gateway Failover = DISABLED
            BulkSync Status = Complete
Average Redundancy Peer Reachability Latency = 436 Micro Seconds

Need your feedback on this

Thanks and regards

Erfan

Scott Fella · ‎12-08-2015

You do not need this since you have a back to back cable. Did you upgrade the controller to v8.0.121.0?

-Scott

-Scott
*** Please rate helpful posts ***

eahmed007 · ‎12-10-2015

Hi scott ,

Now ! We will start the upgrade process .Should we disable the SSO from active WLC and then do the up-gradation as per deployment guide .

We are eagerly waiting for your reply.

Thanks and regards

Erfan

Scott Fella · ‎12-13-2015

Sorry for the late response. Yes you need to disable SSO on the primary. This will make you reboot the controllers so that they can come back up as separate units.

Hope that helps

-Scott

*** Please rate helpful post ***

-Scott
*** Please rate helpful posts ***