Re: CSCvr86743 - CMX 10.6.2 - Redis check failed for master

trapasso · ‎01-28-2020

Does anyone have more information about this bug. There are no usable notes, there are no workaround notes. It is marked as fixed but under the Known Fixed Releases: section there are no documented fixed versions.

I am seeing this alert, and getting a bunch of emails, on this issue after restarting the services on the active member. the secondary member took over but receiving this critical alert over 400 times the last 24 hours. I am currently running CMX 10.6.2-66.

joshua Slaney · ‎04-23-2020

There is a documented fix in 10.6(2.72) listed now. However, installed that yesterday and got my servers paired up within HA. Within 5.5 hours it broke again with this error.

Wed Apr 22 2020 4:21:44 PM

Primary Active

Successfully enabled high availability. Primary is syncing with secondary.

Wed Apr 22 2020 9:52:25 PM

Primary Active

Redis check failed for master. Attempt to restart redis

Wed Apr 22 2020 9:52:27 PM

Primary Active

Redis check failed for master even after a restart of the agent

Wed Apr 22 2020 9:52:28 PM

Primary Failover Invoked

Attempting to failover to secondary. Reason: Redis check get writeable failed for port: 6383Redis check get writeable failed for port: 6383

Then it successfully failed over to the secondary server.

trapasso · ‎04-23-2020

Hi Joshua,

I upgraded our virtual HA CMX servers to CMX 10.6.2-72 in February. The
server lasted 45 days until CMX started only tracking approx 10% of the
devices in our environment. I manually forced a switchover to resolve the
above but then the secondary CMX server started emailing messages similar
to what I pasted below, dozens of them.

I have had 3 previous TAC cases with similar issues in prior versions of
CMX. The last TAC Engineer, so on TAC case 4, stated that I was being
affected by bug CSCvr16016 and that the fix will only take affect after my
10.6.2-72 were rebooted. But my servers were rebooted after the upgrade.
If you look at he bug the workaround states: " reload of box resolves
issue for some time ". One of the suggestions was to reduce the RSSI
cutoff to -65, the default is -85, my servers are set to -75. This
basically reduces the number of devices that CMX will track. a value of
-65 just doesn't make sense so I left my servers where I had it.

My CMX instance is running on a University Campus and with Covid-19 we just
don't have the same load running on these servers. So I decided to close
out the last TAC case. I do expect these servers to eventually fail again.

----------------------------------------------------------------------------------------------------------
You have a new alert from CMX!

analytics Connection failed

Host: istcmxprd-xxxx
Service: analytics
Description: failed protocol test [HTTP] at
[analytics.service.consul]:5556/api/services/analytics/status [TCP/IP] --
HTTP: Error receiving data -- Resource temporarily unavailable
Date: Thu, 19 Mar 2020 05:23:01
---------------------------------------------------------------------------------------------------------------------------------------------------

FileDescriptors Status failed

Host: istcmxprd-xxxx
Service: FileDescriptors
Description: status failed (2) -- 56640 total file handles open
Total file descriptors above bounds
650 open files by process cassandra
33 open files by process redis_6380
17 open files by process redis_6383.pid
41 open files by process consul
27 open files by process redis_6378.pid
27 open files by process gateway
21 open files by process redis_6382.pid
32 open files by process redis_6381
65 open files by process postgres
652 open files by process connect
71 open files by process redis_6384
1886 open files by process location
21 open files by process redis_6382
55 open files by process influxd
17 open files by process redis_6383
33 open files by process redis_6380.pid
27 open files by process redis_6378
71 open files by process redis_6384.pid
1034 open files by process matlabengine
91 op
Date: Thu, 19 Mar 2020 05:26:11

---------------------------------------------------------------------------------------------------------------------------------------------------

analytics Connection succeeded

Host: istcmxprd-xxxx
Service: analytics
Description: connection succeeded to
[analytics.service.consul]:5556/api/services/analytics/status [TCP/IP]
Date: Thu, 19 Mar 2020 05:26:14

---------------------------------------------------------------------------------------------------------------------------------------------------

analytics Connection failed

Host: istcmxprd-xxxx
Service: analytics
Description: failed protocol test [HTTP] at
[analytics.service.consul]:5556/api/services/analytics/status [TCP/IP] --
HTTP: Error receiving data -- Resource temporarily unavailable
Date: Thu, 19 Mar 2020 05:43:14

---------------------------------------------------------------------------------------------------------------------------------------------------

analytics Connection failed

Host: istcmxprd-xxxx
Service: analytics
Description: failed protocol test [HTTP] at
[analytics.service.consul]:5556/api/services/analytics/status [TCP/IP] --
HTTP: Error receiving data -- Resource temporarily unavailable
Date: Thu, 19 Mar 2020 05:49:36

joshua Slaney · ‎04-23-2020

They have a section in the release notes you may want to try:

Important Notes

Tip To clean up long queues and long running processes, we recommend that you schedule a full restart of Cisco CMX once a month during a low activity time, such as late at night or early in the morning. You either can manually restart Cisco CMX or can apply the root patch and create a scheduled CRON job to restart Cisco CMX. The restart takes approximately 5 minutes to complete.

1.

To restart Cisco CMX services, follow these steps: Enter the cmxctl stop -a command.

2. Enter the cmxctl start -a command.

I don't see how that will help given that mine didn't last longer than 6 hours before it failed over. I've had to rebuild my appliances at least 4 times due to issues like the following:

1. Hard disk size doesn't mach up - HA pairing failed. This is even after I confirmed with TAC the sizing was identical.

2. Can't login via ssh with any username/password. Fails auth when secondary is active after an extended period of time

3. Some appliances failed HA pairing until i swapped the roles between primary and secondary

4. API becomes inaccessible after a period of time.

I used to think the MSE was really bad, but these appliances take the issues to another level of bad.

joshua Slaney · ‎05-19-2020

I have an update on this. I had a TAC case with Cisco to determine why we were having these issues. The engineer mentioned that they have seen performance issues when the unique client count approaches 90,000. We had between 140,000 and 120,000 unique clients. You can run the following to see what your unique client counts are:

To determine the number of unique clients SSH to the actve box with cmxadmin:
shell

su
cd /opt/cmx/var/log/location
grep -i "unique device" server* | grep 2020-05 (Note: Grep whatever dates your are looking for. Example shows May 2020)

You will see an output similar to this:

server-1.log:2020-05-01T05:00:00,001 [pool-64-thread-1] INFO com.cisco.mse.location.intf.ElementCounters - Cleaning up element counts, unique devices 30308, locally administered macs 289 as partof daily midnight job

In our case we also adjusted the minimum number of detecting AP from 1 to 2 as well as setting our RSSI cutoff to -75. We felt okay changing the min # of detecting AP to 2 because we have a dense deployment. To change the minimum # of detecting AP you perform the following steps from command line SSH to the active box:

shell
su
curl -X POST -H "Content-Type: application/json" -d '{"minapwithvalidrssi”:2 }' http://localhost/api/config/v1/filteringParams/1

Hopefully this helps.

riwakefi · ‎08-02-2020

This is extraordinarily good info, thank you very much.

The only problem is:

From Cisco CMX Release 10.5.0, you must install the root patch to access root user account.

And no updates to fix any of these redis related bugs (CSCvs86719, CSCvu10693) since April. I'm getting the feeling Cisco is abandoning CMX for DNA Spaces, which is unfortunate since not all customers want their user location data sent to a cloud service.