cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
24491
Views
20
Helpful
4
Replies

Unity Connection Cluster issue

spaansj05
Level 1
Level 1

Hello,

I have two Unity servers, one at each site that recently started displaying the message "Communication is not functioning correctly between the servers in the Cisco Unity Connection cluster."

Users recently began reporting MWI issues at the remote site as well, that site has a Unity Connection Subscriber as well as two Call Manager Subscribers.

Our main site has a CM Publisher and Subscriber as well as a Unity Connection Publisher. 

I have tried restarting the Unity Subscriber and then the Publisher via Settings>Version in the Unified OS Administration but that didn't seem to resolve anything. 

On the Subscriber, under Unity Connection Serviceability, it says that the server status of the Publisher is "Not Reachable". I have verified the devices have IP connectivity between them. 

Any help is appreciated, I've recently been given the reigns of our VoIP system and I only know enough to be dangerous. 

1 Accepted Solution

Accepted Solutions

Ryan Huff
Level 4
Level 4

Hello there!

When you logged into the Unified Serviceability page of the HA node (subscriber) and you saw it say Not Reachable for the publisher, did you also notice if it said, "Split Brain Recovery" as the server status for the HA (subscriber) node? If so, please see http://ryanthomashuff.com/2015/12/cisco-unity-connections-split-brain/. Assuming that IS NOT the case, I have included the steps I would go through to troubleshoot.

If you could log into the CLI (via SSH with the platform administration credentials) of both server nodes; here are some of the tasks I would go through.

  • Are you able to get to the GUI Web page of both Unity Connection server nodes? If you cannot get to the GUI, issue the command, "utils service restart Cisco Tomcat" from the CLI of whichever node(s) you cannot get to the web page on. My guess is, you can (since you did say you restarted both servers, the tomcat service is likely okay).
    • Please note: this can be a service impacting command (specifically if users use PCA, Web Inbox ... etc); do this during a maintenance window or at least with declaration to your user population. After you restart this service, wait about 15-20 minutes and then test if you can get to the GUI web page.

Check to see if the reachable issue is resolved and cluster operation is restored.

  • Type in the following command, "utils diagnose module validate_network" from the CLI of the primary server. You're looking for this command to return the output, "passed". If it returns any other sort of output, please copy/paste in this thread.

PASSED SAMPLE

FAILED SAMPLE

    • If it DOES NOT pass network validation:
      • You need to determine what failed, why it failed and steps to resolve before moving on to the next step.

Check to see if anything you resolved from the failed network validation corrected the reachable issue and if cluster operation is restored.

    • If it DOES pass network validation:
      • The next thing I would is look for excessive network delay between nodes (i.e > 80Ms). Again from the CLI of the publisher; issue the command, "utils dbreplication runtimestate". Look at the ping column and make sure each server is under 80Ms. Also, for good measure, look at the last column on the right and verify each node shows (2) Setup Completed under the Replication Setup column.

      • If you do have excessive delay, you'll need to investigate the who/what/when/where/why on your network.
  • From the CLI of both nodes issue the command, "utils ntp server list". Verify that each NTP server in the list is specified by IP address and not hostname or FQDN. While it should work with hostname/FQDN it is my belief and practice not to; in addition to the fact there have been a few bugs with NTP server addresses specified by hostname/FQDN.
  • From the CLI of the primary server issue the command, "utils ntp status" and verify that it reads, "Syncronized to NTP server (xxx.xxx.xxx.xxx) as stratum 3". You can have a lower Stratum number but Stratum three (3) is actual recomendation from Cisco. You may find the primary at Stratum 3 and the HA node sync'ed to the primary at Stratum 4. This is the default implementation of Unity Connections and is generally OK. The big deal here is to make sure the primary node is at Stratum 3 (or lower) but you could also put the HA node at Stratum 3 (or lower).

Check to see if anything you resolved from the network delay and NTP resolved the reachable issue and if cluster operation is restored.

OK! I realize this may seem like a lot; however, since you stated that you checked the actual network communication between the node segments (which you said is working) and that you have already rebooted both servers (some of the first things I would recommend), that only leaves the business end of troubleshooting!

So, run through all this and hopefully something here will help shed a light on what is going on with the cluster.

Thanks,

Ryan

(: ... Please rate helpful posts ...:)

View solution in original post

4 Replies 4

Live2 Bicycle
Level 3
Level 3

I'm going to assume when you validated ip connectivity you did it either from the OS Administration page of each Unity Connection server or from the CLI of each server. 

from either the OS Administration page or the CLI did were you able to ping both host name and IP address?

Have you validated DNS and reverse DNS is working correctly?

From the CLI of each server run -- utils diagnose test

The utils diagnose test may give you more information to help resolve your issue.

***Don't forget to rate helpful feedback***

Ryan Huff
Level 4
Level 4

Hello there!

When you logged into the Unified Serviceability page of the HA node (subscriber) and you saw it say Not Reachable for the publisher, did you also notice if it said, "Split Brain Recovery" as the server status for the HA (subscriber) node? If so, please see http://ryanthomashuff.com/2015/12/cisco-unity-connections-split-brain/. Assuming that IS NOT the case, I have included the steps I would go through to troubleshoot.

If you could log into the CLI (via SSH with the platform administration credentials) of both server nodes; here are some of the tasks I would go through.

  • Are you able to get to the GUI Web page of both Unity Connection server nodes? If you cannot get to the GUI, issue the command, "utils service restart Cisco Tomcat" from the CLI of whichever node(s) you cannot get to the web page on. My guess is, you can (since you did say you restarted both servers, the tomcat service is likely okay).
    • Please note: this can be a service impacting command (specifically if users use PCA, Web Inbox ... etc); do this during a maintenance window or at least with declaration to your user population. After you restart this service, wait about 15-20 minutes and then test if you can get to the GUI web page.

Check to see if the reachable issue is resolved and cluster operation is restored.

  • Type in the following command, "utils diagnose module validate_network" from the CLI of the primary server. You're looking for this command to return the output, "passed". If it returns any other sort of output, please copy/paste in this thread.

PASSED SAMPLE

FAILED SAMPLE

    • If it DOES NOT pass network validation:
      • You need to determine what failed, why it failed and steps to resolve before moving on to the next step.

Check to see if anything you resolved from the failed network validation corrected the reachable issue and if cluster operation is restored.

    • If it DOES pass network validation:
      • The next thing I would is look for excessive network delay between nodes (i.e > 80Ms). Again from the CLI of the publisher; issue the command, "utils dbreplication runtimestate". Look at the ping column and make sure each server is under 80Ms. Also, for good measure, look at the last column on the right and verify each node shows (2) Setup Completed under the Replication Setup column.

      • If you do have excessive delay, you'll need to investigate the who/what/when/where/why on your network.
  • From the CLI of both nodes issue the command, "utils ntp server list". Verify that each NTP server in the list is specified by IP address and not hostname or FQDN. While it should work with hostname/FQDN it is my belief and practice not to; in addition to the fact there have been a few bugs with NTP server addresses specified by hostname/FQDN.
  • From the CLI of the primary server issue the command, "utils ntp status" and verify that it reads, "Syncronized to NTP server (xxx.xxx.xxx.xxx) as stratum 3". You can have a lower Stratum number but Stratum three (3) is actual recomendation from Cisco. You may find the primary at Stratum 3 and the HA node sync'ed to the primary at Stratum 4. This is the default implementation of Unity Connections and is generally OK. The big deal here is to make sure the primary node is at Stratum 3 (or lower) but you could also put the HA node at Stratum 3 (or lower).

Check to see if anything you resolved from the network delay and NTP resolved the reachable issue and if cluster operation is restored.

OK! I realize this may seem like a lot; however, since you stated that you checked the actual network communication between the node segments (which you said is working) and that you have already rebooted both servers (some of the first things I would recommend), that only leaves the business end of troubleshooting!

So, run through all this and hopefully something here will help shed a light on what is going on with the cluster.

Thanks,

Ryan

(: ... Please rate helpful posts ...:)

It seems to be unfortunately pervasive that people choose to provide links with no context. It's fine to link to an article, but considering 90% are broken this forum loses quite a bit of usefulness if what you're looking for is presented that way. Even more so when the URL isn't a Cisco URL; it's hard enough to get a working one, if it was before the makeover.

 

dlcharville
Level 4
Level 4

I'm having the same issue on a customers 11.0.1.21900-11 system.  There is a bug CSCuu15688 which is close to what I am seeing but not exact.  I've seen the issue happen twice on this cluster/environment where something caused Primary CUC server to think the Secondary CUC server went down.  The Secondary came back and SBR wasn't able to recover.  The first time I saw this it took a reboot of both servers before everything came back to normal.  I'm waiting to decide about rebooting the servers again to see if it fixes the issue a second time.  

The only issue the customer noticed is MWIs not in sync.  They never mentioned about messages not being delivered.


Ryan - any input to this?  Honestly it seems odd that you just wrote a document about this. :)  I was hoping you worked at Cisco and had some insight to a potential known issue with SBR and version 11.

Thanks,
Dan