Cisco Unity Connection Cluster Operation and Troubleshooting

Markus Schneider · ‎08-25-2009

Cisco Unity Connection Cluster Operation and Troubleshooting
Introduction
UC Failover in Operation
Cluster Status
SRM Traces
Split-brain Recovery (SBR)
Other Conditions

Cisco Unity Connection Cluster Operation and Troubleshooting

This document discusses Unity Connection Clustering, the Unity Connection voicemail system's redundancy option. Before proceeding to this, please first read through the Cisco Unity Connection Cluster Configuration and Administration Guide.

Introduction

The Cisco Unity Connection (CUC) 7.0 failover feature--at the database layer--uses IDS' Enterprise Replication (ER). It utilizes a multimaster database replication topology for all databases except for the mailbox databases. To avoid the possibility of a message being received simultaneously on two servers (and one message effectively overwriting the other), only one copy of the mailbox database will be writable at a time. Since the directory database could also encounter such collisions, however, a "last write wins" mechanism is employed.

On top of that, UC has two roles, the Primary and Secondary that are independent of the server's pub/sub status in the cluster. In normal operation, the Publisher should be the Primary and the Subscriber the Secondary. The Primary server is responsible for running services that can only run on one node in a cluster at a time (the so-called "singleton" processes), such as Notifier, MTA, and SysAgent-tasks. The Primary handles all writes to message store database(s) and certain master files such as the encrypt key and certificates are managed in the primary and then replicated to secondary. The benefit is that in this method the system is completely manageable and allows for changes and user updates, even if the Publisher node is offline. Since singleton processes only run on one server, notifications for MWI or other outside numbers as well as outbound VPIM traffic or email notifications will always originate from the Primary node. Inbound VPIM traffic must always reach the Primary server. This is an important consideration for any design.

These roles are managed via the Server Redundancy Manager (SRM). This service, which runs on both nodes, implements the failover logic and figures out which side is the Primary. It maintains a heartbeat between the servers and is responsible for running Split-brain recovery (SBR), which occurs when both servers assumed the Primary role (e.g. they couldn't talk to one another; became Primary; and now they can again). SRM talks to ServM to determine the status of critical processes and requests ServM to start singleton processes after a failover.

Besides database redundancy, Connection uses a file replication mechanism (the "Connection File Syncer" or CuFileSync) to replicate messages and audio files, which are not stored in the database. This is done via SFTP. The following files must be replicated:

Message audio files (/var/opt/cisco/connection/mail/*)
Spoken names/greeting (/var/opt/cisco/connection/lib/*)
Certificates (/var/opt/cisco/connection/security/*)
CuEncrypt key files (/opt/cisco/connection/.security/*)

This document is not intended to go over all aspects of database replication and file synchronization, but rather to look at the operation of the SRM, since it triggers many of these other functions.

UC Failover in Operation

The Connection Server Role Manager (SRM) is the service that runs on both servers that keeps track of the current role of each server. It has a 5-second keepalive mechanism. The communication as per UC7.1.3 is via TCP where the subscriber chooses a random source port and connects to port 22001 of the publisher.

From the Serviceability page, one can access Tools > Cluster Management to view the cluster status or issue the 'show cluster status' command from CLI.

Cluster Status

NOTE: These show commands are taken from UC7.1.3. Prior releases did not indicate the replication status (and it is not possible to easily find the replication status of the CUC databases from the CLI).

admin:show cuc cluster status

Server Name  Member ID  Server State  Internal State  Reason
-----------  ---------  ------------  --------------  ------
cuc1a     0          Primary       Pri Active      Normal
cuc1b     1          Secondary     Sec Active      Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Connected       0 Jul 14 15:13:14

This output is from the "cuc1a" node (note the replication status is "Local"). We see that it is the primary and is the publisher, since it its Member ID is 0. Furthermore, replication is active and it is connected to the sub's database. The cluster status (Primary/Secondary is learned via replication, so if replication is not working, this output may not be correct).

For example, let's say the network between the two servers is interrupted. Now the status may appear as follows. Note that this command what the local server where the command is run is seeing. The status from the other side is learned via replication, so if there's a network disconnect that's obviously not happening.

admin:show cuc cluster status

Server Name  Member ID  Server State  Internal State           Reason
-----------  ---------  ------------  -----------------------  ------
cuc1a     0          Primary       Pri Active Disconnected  Normal
cuc1b     1          Secondary     Sec Active               Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Connecting   2259 Jul 29 10:29:52

This output is from the publisher again. Note that its internal state is Primary Active Disconnected, but the state about the subscriber is basically incorrect. The replication status is "Connection", so it's clear it isn't getting updates. When we look at the same command on the secondary, we see:

Server Name  Member ID  Server State  Internal State                Reason
-----------  ---------  ------------  ----------------------------  ------
cuc1a     0          Primary       Pri Active                    Normal
cuc1b     1          Primary       Sec Act Primary Disconnected  Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Connecting   2938 Jul 29 10:29:59
g_ciscounity_cuc1b  101 Active   Local           0

Both servers show the Primary status.

Once things come back, one may initially see the following on the Primary:

admin:show cuc cluster status

Server Name  Member ID  Server State            Internal State  Reason
-----------  ---------  ----------------------  --------------  ------
cuc1a     0          Split Brain Resolution  Pri SBR         Normal
cuc1b     1          Secondary               Sec Active      Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Dropped    5554742 Jul 14 15:08:14

There the Pub/Primary has kicked off SBR. Replication is still down, but as things start working, both server's state will become 'Split Brain Resolution'.

admin:show cuc cluster status

Server Name  Member ID  Server State            Internal State  Reason
-----------  ---------  ----------------------  --------------  ------
cuc1a     0          Split Brain Resolution  Pri SBR         Normal
cuc1b     1          Split Brain Resolution  Sec SBR         Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Connected   17905 Jul 15 09:49:58

There are several intermediate states before returning back to normal, however these should be completed fairly quickly.

SRM Traces

First thing to look at when troubleshooting failover-related issues is to look at the SRM logs. The following is a portion of the trace from the Primary/Publisher, which was disconnected from the network:

07/15/2009 10:36:14.268 |18039,,,SRM,3,<Timer-241> Ending session due to missing heartbeats|
...
07/15/2009 10:36:14.322 |15252,,,-1,-1,alarm id 9202|
07/15/2009 10:36:14.322 |GenAlarm: AlarmName = NoConnectionToPeer, DeviceName = cuc1a, AlarmMsg = NoConnectionToPeer
AppID : CuSrm
ClusterID :
NodeID : cuc1a
|
07/15/2009 10:36:14.322 |15252,,,SRM,5,<evt> [9202] Lost communication with the remote server cuc1b.vnt.cisco.com in the cluster. The remote server may be down.|
...
07/15/2009 10:36:14.631 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [disconnect] [PUB_PRIMARY_DISCONNECTED]|
07/15/2009 10:36:14.711 |15252,,,SRM,5,<evt> PUB_PRIMARY_DISCONNECTED - Role_activate Start - PUBLISHER_PRIMARY|
07/15/2009 10:36:14.711 |15252,,,SRM,5,<evt> PUB_PRIMARY_DISCONNECTED - Role_activate Done!! - PUBLISHER_PRIMARY [ 0secs. ]|
07/15/2009 10:36:15.011 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [db_state_change] ignored|
...

On the subscriber it looks like this. Note that it starts the MTA and Notifier services and makes itself primary.

07/15/2009 10:36:14.964 |28031,,,SRM,3,<Timer-9> Ending session due to missing heartbeats|
...
07/15/2009 10:36:14.975 |GenAlarm: AlarmName = NoConnectionToPeer, DeviceName = cuc1b, AlarmMsg = NoConnectionToPeer
AppID : CuSrm
ClusterID :
NodeID : cuc1b
|
...
07/15/2009 10:36:14.975 |14161,,,SRM,5,<evt> [9202] Lost communication with the remote server cuc1a.vnt.cisco.com in the cluster. The remote server may be down.|
07/15/2009 10:36:15.078 |14161,,,SRM,5,<evt> [SUB_SECONDARY] [disconnect] [SUB_PRIMARY_DISCONNECTED]|
07/15/2009 10:36:15.095 |14161,,,SRM,5,<evt> Entering SUB_PRIMARY_DISCONNECTED without token|
07/15/2009 10:36:15.101 |14161,,,SRM,5,<evt> SUB_PRIMARY_DISCONNECTED - Role_activate Start - SUBSCRIBER_PRIMARY|
07/15/2009 10:36:15.267 |13952,,,SRM,5,<dbmon> Cluster information has been changed in the database.  ClusterMasterObjectId: c435dcac-d7b8-45ba-b879-785db246f060|
07/15/2009 10:36:15.499 |14161,,,SRM,5,<evt> Activate requested for the following service(s): Connection Message Transfer Agent, Connection Notifier, Connection Groupware Caching Service, |
07/15/2009 10:37:05.504 |14161,,,SRM,5,<evt> SUB_PRIMARY_DISCONNECTED - Role_activate Done!! - SUBSCRIBER_PRIMARY [ 50secs. ]|
07/15/2009 10:37:05.504 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [db_state_change] ignored|
...

When the network connectivity is restored, the primary shows the following:

07/15/2009 10:43:14.233 |15254,,,SRM,5,<listener> Received connection from /14.84.158.21:57590|
07/15/2009 10:43:14.864 |29106,,,SRM,5,<sess-241> [rcv] Type: Register ID: c435dcac-d7b8-45ba-b879-785db246f060 HavePrimaryToken: false NeedSbr: true Version: 7.1.2.39000-35|
07/15/2009 10:43:16.396 |29106,,,SRM,5,<sess-241> [snd] Type: Register ID: ece97bfd-f8b1-409a-8dca-4402311555cb HavePrimaryToken: true NeedSbr: false Version: 7.1.2.39000-35|
07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [snd] Type: State State: PRIMARY|
07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [connect] ignored|
07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [sbr_required] [PUB_SBR]|
07/15/2009 10:43:16.397 |15252,,,-1,-1,alarm id 9216|
07/15/2009 10:43:16.397 |15252,,,SRM,5,<evt> [9216] Split-brain resolution procedure has been initiated.|
07/15/2009 10:43:16.398 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: PRIMARY|
07/15/2009 10:43:16.565 |15252,,,SRM,5,<evt> PUB_SBR - Role_activate Start - PUBLISHER_PRIMARY_SBR|
07/15/2009 10:43:16.871 |15252,,,SRM,5,<evt> Deactivate requested for the following service(s): Connection File Syncer, Connection Message Transfer Agent, Connection System Agent, |
07/15/2009 10:44:06.674 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: IN_SBR|
07/15/2009 10:44:06.882 |15252,,,SRM,5,<evt> PUB_SBR - Role_activate Done!! - PUBLISHER_PRIMARY_SBR [ 50secs. ]|
07/15/2009 10:44:06.882 |15252,,,SRM,5,<evt> [snd] Type: State State: IN_SBR|
...
07/15/2009 11:01:33.670 |15048,,,SRM,5,<CM> Command: /opt/cisco/connection/bin/sbrscript execution completed successfully|
07/15/2009 11:01:33.671 |15252,,,SRM,5,<evt> [PUB_SBR] [sbr_done] [PUB_CHOOSE_ROLE]|
07/15/2009 11:01:33.671 |15252,,,-1,-1,alarm id 9217|
07/15/2009 11:01:33.671 |15252,,,SRM,5,<evt> [9217] Split-brain resolution procedure run successfully.|
07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [snd] Type: SbrComplete|
07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [snd] Type: State State: INITIALIZING|
07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [PUB_CHOOSE_ROLE] [go_primary] [PUB_PRIMARY]|
07/15/2009 11:01:34.477 |15252,,,SRM,5,<evt> PUB_PRIMARY - Role_activate Start - PUBLISHER_PRIMARY|
07/15/2009 11:01:34.638 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: INITIALIZING|
07/15/2009 11:01:34.991 |15252,,,SRM,5,<evt> Activate requested for the following service(s): Connection Message Transfer Agent, Connection System Agent, Connection File Syncer, |
07/15/2009 11:02:19.993 |15252,,,SRM,5,<evt> PUB_PRIMARY - Role_activate Done!! - PUBLISHER_PRIMARY [ 45secs. ]|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [snd] Type: State State: PRIMARY|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_initializing] ignored|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:33.172 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:40.232 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: IN_DB_SYNC|
07/15/2009 11:02:40.234 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_in_db_sync] ignored|
07/15/2009 11:02:40.721 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:40.800 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:03:30.832 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: SECONDARY|
07/15/2009 11:03:30.833 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_secondary] [PUB_PRIMARY]|

The Secondary shows the following:

07/15/2009 10:43:14.238 |14162,,,SRM,5,<session> Connection to cuc1a.vnt.cisco.com/14.84.158.20:22001|
07/15/2009 10:43:14.828 |14162,,,SRM,5,<session> [snd] Type: Register ID: c435dcac-d7b8-45ba-b879-785db246f060 HavePrimaryToken: false NeedSbr: true Version: 7.1.2.39000-35|
07/15/2009 10:43:16.402 |14162,,,SRM,5,<session> [rcv] Type: Register ID: ece97bfd-f8b1-409a-8dca-4402311555cb HavePrimaryToken: true NeedSbr: false Version: 7.1.2.39000-35|
07/15/2009 10:43:16.403 |14161,,,SRM,5,<evt> [snd] Type: State State: PRIMARY|
07/15/2009 10:43:16.403 |14162,,,SRM,5,<session> [rcv] Type: State State: PRIMARY|
07/15/2009 10:43:16.403 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [connect] ignored|
07/15/2009 10:43:16.404 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [sbr_required] [SUB_SBR]|
07/15/2009 10:43:16.414 |14161,,,-1,-1,alarm id 9216|
07/15/2009 10:43:16.414 |14161,,,SRM,5,<evt> [9216] Split-brain resolution procedure has been initiated.|
07/15/2009 10:43:16.436 |14161,,,SRM,5,<evt> SUB_SBR - Role_activate Start - SUBSCRIBER_SECONDARY_SBR|
07/15/2009 10:43:16.675 |14161,,,SRM,5,<evt> Deactivate requested for the following service(s): Connection File Syncer, Connection System Agent, Connection Groupware Caching Service, Connection Message Transfer Agent, Connection Notifier, |
07/15/2009 10:44:06.679 |14161,,,SRM,5,<evt> SUB_SBR - Role_activate Done!! - SUBSCRIBER_SECONDARY_SBR [ 50secs. ]|
07/15/2009 10:44:06.679 |14161,,,SRM,5,<evt> [snd] Type: State State: IN_SBR|
07/15/2009 10:44:06.694 |14161,,,-1,-1,alarm id 9224|
07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [9224] Regained communication with the remote server cuc1a.vnt.cisco.com in the cluster.|
07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [SUB_SBR] [peer_primary] ignored|
07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored|
...
07/15/2009 10:44:06.928 |14162,,,SRM,5,<session> [rcv] Type: State State: IN_SBR|
07/15/2009 10:44:06.928 |14161,,,SRM,5,<evt> [SUB_SBR] [peer_in_sbr] ignored|
07/15/2009 10:44:07.090 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored|
07/15/2009 10:44:07.100 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored|
07/15/2009 11:01:34.420 |14162,,,SRM,5,<session> [rcv] Type: SbrComplete|
07/15/2009 11:01:34.420 |14161,,,SRM,5,<evt> [SUB_SBR] [sbr_done] [SUB_CHOOSE_ROLE]|
07/15/2009 11:01:34.421 |14161,,,-1,-1,alarm id 9217|
07/15/2009 11:01:34.421 |14161,,,SRM,5,<evt> [9217] Split-brain resolution procedure run successfully.|
07/15/2009 11:01:34.421 |14162,,,SRM,5,<session> [rcv] Type: State State: INITIALIZING|
07/15/2009 11:01:34.615 |14161,,,SRM,5,<evt> [snd] Type: State State: INITIALIZING|
07/15/2009 11:01:34.616 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [peer_initializing] ignored|
07/15/2009 11:01:34.650 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored|
07/15/2009 11:01:34.663 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored|
07/15/2009 11:01:34.675 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored|
07/15/2009 11:02:20.153 |14162,,,SRM,5,<session> [rcv] Type: State State: PRIMARY|
07/15/2009 11:02:20.174 |14161,,,-1,-1,alarm id 9224|
07/15/2009 11:02:20.174 |14161,,,SRM,5,<evt> [9224] Regained communication with the remote server cuc1a.vnt.cisco.com in the cluster.|
07/15/2009 11:02:20.174 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [peer_primary] [DB_SYNC]|
07/15/2009 11:02:40.199 |14161,,,SRM,5,<evt> [snd] Type: State State: IN_DB_SYNC|
07/15/2009 11:02:40.199 |14161,,,SRM,5,<evt> [DB_SYNC] [db_sync_unnecessary] [SUB_SECONDARY]|
07/15/2009 11:02:40.258 |14161,,,SRM,5,<evt> Cluster information has been changed in the database.  ClusterMasterObjectId: ece97bfd-f8b1-409a-8dca-4402311555cb|
07/15/2009 11:02:40.264 |14161,,,SRM,5,<evt> SUB_SECONDARY - Role_activate Start - SUBSCRIBER_SECONDARY|
07/15/2009 11:02:40.778 |14161,,,SRM,5,<evt> Activate requested for the following service(s): Connection System Agent, Connection File Syncer, |
07/15/2009 11:03:30.784 |14161,,,SRM,5,<evt> SUB_SECONDARY - Role_activate Done!! - SUBSCRIBER_SECONDARY [ 50secs. ]|
07/15/2009 11:03:30.800 |14161,,,SRM,5,<evt> [snd] Type: State State: SECONDARY|
07/15/2009 11:03:30.800 |14161,,,SRM,5,<evt> [SUB_SECONDARY] [db_state_change] [SUB_SECONDARY]|

Split-brain Recovery (SBR)

SBR is the process that runs when both publisher and subscriber have figured out that they are both primary (i.e. they weren't able to communicate to one another). A script is run, /opt/cisco/connection/bin/sbrscript which calls a check-aborted-transaction script followed by another script to resolve the mailbox databases. This whole process can take a few minutes up to an hour. After an hour, the process is interrupted. If the process (due to a defect or some other issue) takes the full hour, you'll see the FsmTimer and the SRM logs will look like this:

07/29/2009 07:15:52.949 |9380,,,SRM,5,Thread=EH  [9216] Split-brain resolution procedure has been initiated.|
...
07/29/2009 08:16:58.318 |9379,,,SRM,5,Thread=FsmTimer  [9217] Split-brain resolution procedure run successfully.|

Note that it's clearly an hour later and the FsmTimer expired. When it works normally, it will appear as "SRM,5,<evt> [9217] Split-brain resolution procedure run successfully."

check-aborted-transactions

The bulk of time spent in SBR is running this script. Logs for this are written to the aborted-transactions-resolution-<date/time> logs (file list activelog cuc/aborted-transactions-resolution*). They do not currently have timestamps for the commands other than the log file name itself. If there are aborted transactions, they are written and compressed to aborted-transactions-<date/time>.tgz. So what does this script do?

It waits for CDR to process its replication queues
Checks for aborted transactions. You'll see in the logs something like "Transactions have been aborted. Processing 959 ATS files."
It tar's and zip's up those aborted transactions and puts them into the aborted-transactions-<date/time>.tgz file.
Next it repairs the databases. This is the longest step and the databases are merged. Upon completion, a "Repair successful" message is logged. This is the end of the script. For each database, the repair
1. disables the database constraints
2. runs a cdr check replset which compares data on the servers and repairs any inconsistencies. For each,you'll get a report showing any inconsistencies.
3. re-enables constraints

This entire process is logged in a file ""aborted-transactions-resolution-yy-mm-ddT<timestamp>""

SBR Errors during mailbox reconciliation

At the end of SBR, the mailbox databases must be consolidated to ensure no messages are lost and they are numbered properly. It follows the following logic for each mailbox database:

Find the time of the last known good communication
Find all mailstores
Unmount the first one
Connect to the mailstore
Renumber messages
Refresh mail counts and mailbox sizes
Remount mailstore
Repeat for other mailbox store databases
Initiate a full MWI resync

Obviously, when the mailbox database is down, a user won't be able to access their messages. Even on large databases, this typically doesn't take longer than about a minute or two. The user will hear "your messages are currently not available" and similar prompts.

If there are errors during the mailbox resolution process, in UC7.1.3 they are written to the mbxsbr.txt file. Errors could potentially look something like this:

Error resolving splitbrain, database: UnityMbxDb1 objectid: dfacd9cf-2fad-4b86-8a04-31daae6b61d8
SQLCODE -458 in EXECUTE:
IX000: Long transaction aborted.
IX000: RSAM error: Long transaction detected.

Error resolving splitbrain, database: UnityMbxDb2 objectid: 09bef2f6-31b7-416f-adbc-89b70b835f54
SQLCODE -387 in CONNECT:
IX000: No connect permission.
IX000: ISAM error:  no record found.

Other Conditions

Replication Queue Filling up

By default, if the servers are disconnected so long that the replication queue grows to 90% of the maximum threshold, then replication is stopped. The SRM diagnostics will show something like:

07/28/2009 23:53:56.159 |10334,,,SRM,3,<Timer-0> Replication queue size: 93.0 has exceeded the maximum threshold value. Stopping replication.|
...
07/28/2009 23:54:04.452 |10333,,,SRM,5,<CM> Command: /opt/cisco/connection/lib/config-modules/dbscripts/repl-helper stop_replication execution completed successfully|
07/28/2009 23:54:06.161 |10334,,,SRM,5,<Timer-0> DB replication has been stopped.|
07/28/2009 23:54:06.224 |10334,,,SRM,5,<Timer-0> [9222] Database replication queue size has exceeded the maximum threshold. Replication between redundant servers has been stopped.|

An alarm is also sent so this condition is visible in RTMT. After this, the replication has to be set up again by accessing the Serviceability page and activating the subscriber again under Tools > Cluster Management or with the utils cuc cluster activate command.

Drives filling up

Any time the drives fill up, it's a bad thing. RTMT generates alerts for this. Also, from the CLI, show diskusage <partition> will also help with identify this issue.

Replication Processes hung or coredump

In some instances, the replication processes themselves may get hung or coredump. A core dump (from CDR, for example), will be in /var/log/active/core, the same place as other cores. It is possible for a cdr process to hang and SRM to continue working fine, so it is often important to look for hung processes (ps auxf | grep cdr will grab cdr-related stuff). If something's been running for ages, then chances are, it's hung.

utils cuc cluster CLI commands

utils cuc cluster activate - described earlier
utils cuc cluster deactivate - disables replication and forces node to deactivated. Sometimes used when replacing the publisher.
utils cuc cluster makeprimary - run from the node that is currently the primary. Makes the other server the primary.
utils cuc cluster overwritedb - copies data from the publisher to the subscriber
utils cuc cluster renegotiate - used when a publisher is being replaced to join the new publisher to the cluster and then copy over the database from the subscriber to the publisher

RTMT

Alarms

In many cases, the only indication that something went wrong is when RTMT generates an alarm on on each node ("No Connection To Peer") from CuSrm (obviously the failed node may not be in a position to report a failure). There should also be a system-level ServerDown alarm. The subscriber will also indicate an AutoFailoverSucceeded alarm. Keep in mind that if the nodes are not communicating with one another, then RTMT traffic will be affected, too. So often only information from the node being monitored will be visible. There is currently no alarm that SBR has initiated or completed.

Performance Counters

The Replication state for UC databases is not tracked in RTMT. Currently the counter (""Number of Replicates Created and State of ReplicationCount)\Replicate_State"") exists, however it refers to the CUCM database and therefore not relevant to CUC. There is also no way to directly see the number of messages in the drop folder (which is where messages get queued before entering the MTA queue to be processed), although one can monitor \CUC Message Store\Queued Messages Current on the primary. This is the queue where messages are moved to when they're ready to be delivered (and is drained at 100 messages/minute). So when the drop folder is full, this queue will stay around 300 (its max) for the period. Once the drop folder is cleared, then this will drop down to typically some small number, depending on the load on the system.

Mark Turpin · ‎07-09-2012

It should be noted that for this command:

utils cuc cluster overwritedb - copies data from the publisher to the subscriber

It should be executed on the subscriber in order to copy the data from the publisher to the subscriber.

Cisco Unity Connection Clustering

Cisco Unity Connection Cluster Operation and Troubleshooting

Introduction

UC Failover in Operation

Cluster Status

SRM Traces

Split-brain Recovery (SBR)

check-aborted-transactions

SBR Errors during mailbox reconciliation

Other Conditions

Replication Queue Filling up

Drives filling up

Replication Processes hung or coredump

utils cuc cluster CLI commands

RTMT

Alarms

Performance Counters