07-25-2012 07:29 AM - edited 03-16-2019 12:22 PM
Hi guys,
I have a problem with my CUCM cluster. It is a 8.6 cluster with a pub and one sub. It is a cluster over the wan (vpn between locations). Untill a couple of days ago everything was fine. In the meantime a router that was tacking care of the connections between the two locations stopped working and was replaced about an hour later with another one.
The problem is that now the cluster has some big problems. The two servers are able to ping each other but it seems that the Pub can not connect to the sub DB anymore.
In the attached doc you can find more info, out of which I extracted a few data below:
PUBLISHER:
admin:
admin:show tech network hosts
-------------------- show platform network --------------------
/etc/hosts File:
#This file was generated by the /etc/hosts cluster manager.
#It is automatically updated as nodes are added, changed, removed from the cluster.
127.0.0.1 localhost
::1 localhost
10.1.55.5 CMP
10.1.157.6 CMS
admin:utils diagnose module validate_network
Log file: platform/log/diag2.log
Starting diagnostic test(s)
===========================
test - validate_network : Passed
Diagnostics Completed
admin:show network cluster
10.1.55.5 cmp Publisher not authenticated - INITIATOR since Wed Jul 25 15:47:18 2012
10.1.157.6 cms Subscriber authenticated
admin:utils dbreplication runtimestate
DB and Replication Services: ALL RUNNING
Cluster Replication State: Only available on the PUB
DB Version: ccm8_6_1_20000_1
Number of replicated tables: 0
Cluster Detailed View from SUB (2 Servers):
PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP
SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT)
----------- ------------ ------ ---- ----------- ----- ------- ----- -----------------
CMP 10.1.55.5 1.35 No Off-Line N/A ? No (?)
CMS 10.1.157.6 0.048 Yes Off-Line N/A match Yes (0)
Unified CM Database Access
Local and publisher databases accessible. |
View Details
Server | Publisher DB Reachable | Local DB Reachable |
10.1.55.5 | true | true |
10.1.157.6 | true | true |
Unified CM Database Status
RTMT Counter Information |
Connection to RTMT on 10.1.157.6 could not be established |
All servers have a replication count of 541. |
Not all servers have a good replication status. See the details. |
View Details
Server | Number of Replicates Created | Replicate_State | ||
10.1.55.5 | 541 | 3 - bad | ||
10.1.157.6 | N/A | N/A | ||
See also Database Summary Screen in RTMT. | ||||
Run CLI command (show tech dbstateinfo) for more detail. |
Replication Server List (cdr list serv) from every server for debugging purposes only. |
View Details
Server | cdr list serv | ||
10.1.55.5 | SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_cmp_ccm8_6_1_20000_1 2 Active Local 0 | ||
10.1.157.6 | |||
Replication Server Template (cdr list template) from every server for debugging purposes only. | |||
View Details
Database Prefs File |
View Details
Unified CM Hosts
All servers have equivalent host files |
View Details
Server | Host Information |
10.1.55.5 | #This file was generated by the /etc/hosts cluster manager. #It is automatically updated as nodes are added, changed, removed from the cluster. 127.0.0.1 localhost ::1 localhost 10.1.55.5 CMP 10.1.157.6 CMS |
10.1.157.6 | #This file was generated by the /etc/hosts cluster manager. #It is automatically updated as nodes are added, changed, removed from the cluster. 127.0.0.1 localhost ::1 localhost 10.1.55.5 CMP 10.1.157.6 CMS |
Unified CM Rhosts
All servers have equivalent rhosts files. |
View Details
Server | rhosts File |
10.1.55.5 | localhost CMS CMP |
10.1.157.6 | localhost CMP CMS |
Unified CM Sqlhosts
All servers have equivalent sqlhosts files. |
View Details
Server | sqlhosts File |
10.1.55.5 | g_hdr group - - i=1 g_cmp_ccm8_6_1_20000_1 group - - i=2 cmp_ccm8_6_1_20000_1 onsoctcp 10.1.55.5 cmp_ccm8_6_1_20000_1 g=g_cmp_ccm8_6_1_20000_1 b=32767,rto=300 g_cms_ccm8_6_1_20000_1 group - - i=3 cms_ccm8_6_1_20000_1 onsoctcp 10.1.157.6 cms_ccm8_6_1_20000_1 g=g_cms_ccm8_6_1_20000_1 b=32767,rto=300 ###NOTE: Need to use ipv4 address in host column of sqlhosts file and not hostname cmp_car8_6_1_20000_1 onsoctcp 10.1.55.5 cmp_car8_6_1_20000_1 b=32767 |
10.1.157.6 | g_hdr group - - i=1 g_cmp_ccm8_6_1_20000_1 group - - i=2 cmp_ccm8_6_1_20000_1 onsoctcp 10.1.55.5 cmp_ccm8_6_1_20000_1 g=g_cmp_ccm8_6_1_20000_1 b=32767,rto=300 g_cms_ccm8_6_1_20000_1 group - - i=3 cms_ccm8_6_1_20000_1 onsoctcp 10.1.157.6 cms_ccm8_6_1_20000_1 g=g_cms_ccm8_6_1_20000_1 b=32767,rto=300 |
Please help. I read a lot about the subject, but could not reach a solution.
Thank you.
07-25-2012 08:03 AM
I don't think splitting cluster across VPN is supported, so I would definitely try to mitigate that.
Here is a great doc on troubleshooting and resolving DB replication issues:
https://supportforums.cisco.com/docs/DOC-13672
HTH,
Chris
07-26-2012 12:09 AM
Hi,
I read the document and tried the solutions suggested in there but they did not fix my problem.
In my case the Pub can't connect to the Sub although it seems to know it exist and can ping it.
Pub cli:
admin:utils dbreplication runtimestate
DB and Replication Services: ALL RUNNING
Cluster Replication State: REPLICATION RESET all Started at 2012-07-25-16-41
DB Version: ccm8_6_1_20000_1
Number of replicated tables: 541
Cluster Detailed View from PUB (2 Servers):
PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP
SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT) & details
----------- ------------ ------ ---- ----------- ----- ------- ----- -----------------
CMP 10.1.55.5 0.044 Yes Connected 0 match Yes (3) PUB
CMS 10.1.157.6 1.28 No Off-Line N/A ? No (?) N/A
admin:show network cluster
10.1.157.6 cms Subscriber not authenticated - INITIATOR since Wed Jul 25 15:51:16 2012
10.1.55.5 cmp Publisher authenticated
admin:run sql select name,node id from process node
The specified table (process) is not in the database.
Similar output on the Sub.
It worked over the WAN connection (actually is a tunnel with crypto maps). Could this be the problem? Pub can not connect to the Sub because of some network problem although utils diagnose module validate_network says it is ok?
I am thinking about deleting the sbscriber and then adding it again. Is there a procedure for that.
Thank you for your help.
Silviu.
11-12-2012 06:59 PM
hi ronin,
can you resolve this issues???.....i am having the same problem with the same scenario, my connections of wan goes through ipsec vpn (asa8.2.5---asa8.2.5) and my cucm cluster show the same error on replication runtime command and all them.
please i need urgent help.
11-13-2012 01:57 AM
As I recall I did nothing to the CUCMs. The problem was the network.
The problem is that the packets sent by the CUCM are too large (close to 1500) and the network needs to fragment them but you problebly do not have the df bit set to 0.
I see three solutions to your problem (try to see if one matches your requirements):
1) stop using ipsec (and since you will lose the added overhead the packets will not need to be fragmented anymore so the CUCM will see each other ok)
2) fragment the packets & adjust the mss to a lower value
route-map DF-BIT permit 10
match ip address 1455
set ip df 0
interface GigabitEthernet0/1.157
description == The Subinterface on which you can find the Sub which is in a different location than the Pub ==
encapsulation dot1Q 157
ip address x y
ip tcp adjust-mss 1300
ip policy route-map DF-BIT
3) When you install a CUCM, the installer will ask you at a certain point to set the MTU, so set it to a lower value (so you must reinstall the sub that is causing problems). I think this is the recommended path.
This sounds like a trial and error mechanism, so my recommendation is try to involve TAC because I am sure they hit this problem many times and they will probably give you a fast and correct response to your problem.
Hope this helps,
Silviu.
11-05-2012 05:01 AM
Hello all,
Please take a look at the blog and video I posted on dbreplication runtimestate:
Please feel free to provide your feedback and any additional questions you may have on this topic.
Thanks,
Harmit.
05-28-2013 06:28 AM
Hey guys,
I have faced the same problem in my Cluster. But the only difference is that replication is good in the reports.
But the message on sqlhosts File appear as mentioned:
###NOTE: Need to use ipv4 address in host column of sqlhosts file and not hostname
admin:show tech network hosts
-------------------- show platform network --------------------
/etc/hosts File:
#This file was generated by the /etc/hosts cluster manager.
#It is automatically updated as nodes are added, changed, removed from the cluster.
127.0.0.1 localhost
::1 localhost
10.0.55.194 BRDC1CUP0003
10.0.55.196 BRDC1CUP0001.name BRDC1CUP0001
10.0.55.197 BRDC1CUP0002
10.0.55.195 BRDC1CUP0004
admin:show network cluster
10.0.55.194 brdc1cup0003 Publisher authenticated
10.0.55.196 brdc1cup0001.name brdc1cup0001 Subscriber authenticated using UDP since Tue May 28 09:43:14 2013
10.0.55.197 brdc1cup0002 Subscriber authenticated using UDP since Tue May 28 09:43:14 2013
10.0.55.195 brdc1cup0004 Subscriber authenticated using TCP since Tue May 28 09:49:28 2013
admin:utils dbreplication runtimestate
DB and Replication Services: ALL RUNNING
Cluster Replication State: Replication status command started at: 2013-05-28-10-11
Replication status command COMPLETED 541 tables checked out of 541
No Errors or Mismatches found.
Use 'file view activelog cm/trace/dbl/sdi/ReplicationStatus.2013_05_28_10_11_29.out' to see the details
DB Version: ccm8_6_2_20000_2
Number of replicated tables: 541
Cluster Detailed View from PUB (2 Servers):
PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP
SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT) & details
----------- ------------ ------ ---- ----------- ----- ------- ----- -----------------
BRDC1CUP0003 10.0.55.194 0.075 Yes Connected 0 match Yes (2) PUB Setup Completed
BRDC1CUP0004 10.0.55.195 0.269 Yes Connected 0 match Yes (2) Setup Completed
Server | cdr list serv |
---|---|
10.0.55.194 | SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_brdc1cup0003_ccm8_6_2_20000_2 2 Active Local 0 g_brdc1cup0004_ccm8_6_2_20000_2 3 Active Connected 0 May 28 09:49:30 |
10.0.55.195 | SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_brdc1cup0003_ccm8_6_2_20000_2 2 Active Connected 0 May 28 09:49:30 g_brdc1cup0004_ccm8_6_2_20000_2 3 Active Local 0 |
Server | sqlhosts File |
---|---|
10.0.55.194 | g_hdr group - - i=1 g_brdc1cup0003_ccm8_6_2_20000_2 group - - i=2 brdc1cup0003_ccm8_6_2_20000_2 onsoctcp 10.0.55.194 brdc1cup0003_ccm8_6_2_20000_2 g=g_brdc1cup0003_ccm8_6_2_20000_2 b=32767,rto=300 g_brdc1cup0004_ccm8_6_2_20000_2 group - - i=3 brdc1cup0004_ccm8_6_2_20000_2 onsoctcp 10.0.55.195 brdc1cup0004_ccm8_6_2_20000_2 g=g_brdc1cup0004_ccm8_6_2_20000_2 b=32767,rto=300 ###NOTE: Need to use ipv4 address in host column of sqlhosts file and not hostname brdc1cup0003_car8_6_2_20000_2 onsoctcp 10.0.55.194 brdc1cup0003_car8_6_2_20000_2 b=32767 |
10.0.55.195 | g_hdr group - - i=1 g_brdc1cup0003_ccm8_6_2_20000_2 group - - i=2 brdc1cup0003_ccm8_6_2_20000_2 onsoctcp 10.0.55.194 brdc1cup0003_ccm8_6_2_20000_2 g=g_brdc1cup0003_ccm8_6_2_20000_2 b=32767,rto=300 g_brdc1cup0004_ccm8_6_2_20000_2 group - - i=3 brdc1cup0004_ccm8_6_2_20000_2 onsoctcp 10.0.55.195 brdc1cup0004_ccm8_6_2_20000_2 g=g_brdc1cup0004_ccm8_6_2_20000_2 b=32767,rto=300 |
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide