12-06-2009 06:30 PM - edited 03-12-2019 09:24 AM
Steps to Troubleshoot CUCM Database Replication Problems in 5.x and 6.x
(written by Bill Benninghoff, heavily borrowing from material written by Laurie Dotter and Nancy Balsbaugh)
If installation has proceeded correctly, then the informix cdr service should be running on the publisher and on each sub in the cluster. “Cdr” in this context means “Continuous Data Replication”, not call detail records.
In order to setup replication, scripts run during the install process that do these things:
a. define replication on the pub
b. define the template on the pub and realize it (tells pub what to replicate)
c. define replication for each sub
d. realize the template on each sub
e. synch the data between the pub and subs using “cdr sync” or “cdr check”
It is possible that this process broke down at one of the steps.
If you look at the RTMT replication counter and see that the replication state counter is a 3 or a 4 for a given server that means replication has failed for that server.
Here are some suggested steps to troubleshoot replication.
This will generate an output file. Study the file to see if replication is setup to each server and if the data is in synch among the servers.
For example:
SERVER ID STATE STATUS QUEUE CONNECTION CHANGED
-----------------------------------------------------------------------
g_bldr_ccm4_ccm 2 Active Local 0
g_bldr_ccm5_ccm 3 Active Connected 0 Sep 6 16:27:15
This section above means that replication is working on the pub (local) and on the sub (connected)
Node Rows Extra Missing Mismatch Processed
---------------- --------- --------- --------- --------- ---------
g_bldr_ccm4_ccm 0 0 0 0 0
g_bldr_ccm5_ccm 0 0 0 0 0
This section above means that there are no rows missing between the databases on the two servers. They are in perfect synch.
Use “utils dbreplication repair all” command if replication is set up, but some tables are out of sync. If only one sub is out of sync, you can run this on one node, else use “utils dbreplication repair all” to fix it for all nodes
Here is an example of a problem with replication from the ouptut file:
SERVER ID STATE STATUS QUEUE CONNECTION CHANGED
-----------------------------------------------------------------------
g_bldr_ccm4_ccm 2 Active Local 0
g_bldr_ccm5_ccm 3 Active Dropped 636 Sep 11 14:01:20
If you see that a server’s status is “Dropped” or “Quiescent” or just missing from the table, then you will need to troubleshoot the network connection between the pub and subs.
Another useful diagnostic command is “cdr list serv”. You have to be root to run this command and you can run it on the pub and on each sub to show which servers have been defined from the perspective of the server you are on, and what state those defined servers are in. Here is an example of the output of that command:
SERVER ID STATE STATUS QUEUE CONNECTION CHANGED
-----------------------------------------------------------------------
g_acopup01_ccm 2 Active Local 0
g_acopus01_ccm 3 Active Connected 0 Dec 6 19:31:46
g_acopus02_ccm 4 Active Connected 0 Dec 6 19:31:47
g_acopus03_ccm 5 Active Connected 0 Dec 6 19:31:47
g_acoput01_ccm 6 Active Connected 0 Dec 6 19:31:46
g_erepus04_ccm 7 Active Connected 0 Dec 14 11:08:53
g_erepus05_ccm 8 Active Connected 0 Dec 6 19:31:46
g_erepus06_ccm 9 Active Connected 0 Dec 6 19:31:47
g_ereput02_ccm 10 Active Connected 0 Dec 6 19:31:46
g_londos07_ccm 11 Active Connected 0 Dec 6 19:31:47
The status of “Local” or “Active” is good. A bad status would be “Dropped” or “Quiescent”. If the server is missing from this list then it is not yet defined.
ping <pub name> -s 1500
ping <sub name> -s 1500
Clm is responsible for adding hosts to the iptables rules. clm on sub and clm on pub exchange handshakes. Clm on the pub puts the sub in the policy injected state and adds the host to the iptables rules allowing replication to work. So, if iptables is blocking replication, the clm's are not talking. Clm communicates over 8500/udp and often times with large packets which means they are fragmented. If pmtu discovery is broken (ie., icmp packets are dropped/not sent) or fragments are not allowed through the network then clm does not communicate, iptables is not open, and as a result replication does not work.
In the clm logs on the pub look for entries about communications with the sub, most importantly one saying that the sub was put into policy injected state.
dbl rpchello <pub name>
If that command returns an error then check to see if the dbl rpc service is DBLrunning by doing this:
ps –ef | grep dblrpc
If you don’t see anything that is a problem. Dblrpc must be running on the sub in order for replication to be setup. Once replication is established dblrpc no longer needs to run.
To start up the dbl rpc service on the sub do this as root:
controlcenter.sh "A Cisco DB Replicator" start
/etc/init.d/iptables stop
Dbaccess
This runs a program in which you can select “connnect” and try to connect to the informix database on each server in the cluster. If you are able to connect from the sub to the pub then the network connection is good and your problem is something else.
1. /etc/hosts
2. /etc/services (very bottom of the file)
3. /home/informix/.rhosts
4. /usr/local/active/cm/db/informix/etc/sqlhosts
a. on the sub run this as admin: utils dbreplication stop
b. on the pub run this as admin: utils dbreplication stop
c. on the pub run this as admin: utils dbreplication reset <name of sub that is not working>
/var/log/active/cm/trace/dbl
Run this command :
ls –lrt
This will list all the files in the directory in with the most recent files at the bottom. Scan that list to see if there is a file in there with the word “define” in the filename and also the name of the server or servers that are having trouble.
For example:
rw-rw-rw- 1 root root 9392 Sep 15 15:55 dbl_repl_cdr_define_nw104a_202-2007_09_15_15_53_27.log
If this file shows up it is a good sign. The system is attempting to define replication for that server. Open the file and make sure there are no errors in the file.
onstat –m
This will display the tail of the ccm.log. This is the main informix replication log which shows what is currently happening with replication. Look for possible errors in that file.
To see the CLI commands, type:
admin: utils dbreplication ?
Most commonly used 4 commands are:
utils dbreplication status
(checks each table on all servers, sees if tables out of synch)
utils dbreplication repair
(use to sync tables, run it on the pub. Can be run for a sub, or all.
“utils dbreplication repair <nodename>” will sync one sub.
“utils dbreplication repair all” will sync all.)
utils dbreplication stop
(use this command on sub and pub before a “utils dbreplication reset” .
Run the “stop” locally on sub, then pub. If you are going to reset all, run stop on each sub, then on the pub)
utils dbreplication reset
(use to restart replication on one sub or all nodes.
Use “utils dbreplication reset <nodename>” to reset replication on one sub.
Use “utils dbreplication reset all” to reset replication on pub and all subs.)
i like this article, however I succeed to be 'root' on my CUCM only in my lab. Cisco support agreement probably do not accept editing /etc/shadow, passwd, securetty files etc in production (understand-supported) systems.
But thanks anyway, nice entry!
Excellent document! Thank you
Hi vivkalra,.
Would you know if CUCM 8.6 and 9.x was the same process and explanation?
Best regards,
Daniel
Hello,
Thank you for this article, but it does not solve all my client's issue, as some commands must be done as root:
- show tech dbstateinfo
- cdr list serv
- dbl rpchello 172.20.0.1
But we don't have access to root account: apparently, root access to CUCM Publisher is reserved to Cisco TAC Agents.
Could you please help me?
Clément DESHAYES
Unified Communication Engineer
Dimension Data Belgium
Tel: +32 (0) 2 745 04 56
clement.deshayes_AT_dimensiondata.com
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: