CUCM Clusters stuck in syncing mode...

Forcefield · ‎06-28-2017

Hello

Has anyone come across a similar issue and if so what was the fix?

Currently I have a cluster (version 10.5.2) thats shows the runtimestate on all subcribers to be syncing...

When I look at the CU reports i see all the subs as initializing. However normal functionality is fine and backups run without issues also. When I run a status command on dbreplication i can see everything connected.

I have gone through troubleshooting dbreplication with repair etc as well as restarting all servers also. I am unsure as to where to go next in troubleshooting other than upgrading the cluster. Seems very strange as it is having no impact to daily use.

The only anomoly i see is that NTP is at stratum 5 and that does give me a message to say that it is not recommended.

Thank you all

FF

Jitender Bhandari · ‎06-28-2017

Hi FF,

you would have to fix NTP first try keeping the stratum below 3. Can you attach below.

Utils diagnose test

Show ntp status.

JB

Forcefield · ‎07-13-2017

Hi JB

We have had a network outage since we last spoke and interestingly I am since seeing a change for my local cluster:

PING DB/RPC/ REPL. Replication REPLICATION SETUP

SERVER-NAME IP ADDRESS (msec) DbMon? QUEUE Group ID (RTMT) & Details

----------- ---------- ------ ------- ----- ----------- ------------------

PUB01 10.PPP.1P3.PPP 0.015 Y/Y/Y 0 (g_2) (2) Setup Completed

1SUB02 10.PPP.1P3.PPP 0.182 Y/Y/Y 0 (g_3) (2) Setup Completed

1TFTP03 10.PPP.1P3.PPP 0.140 Y/Y/Y 0 (g_4) (2) Setup Completed

2SUB01 10.MMM.1M3.MMM 43.499 Y/Y/Y -- (-) (0) Syncing...

2SUB02 10.MMM.1M3.MMM 45.028 Y/Y/Y 0 (g_6) (0) Syncing...

2TFTP03 10.MMM.1M3.MMM 43.610 Y/Y/Y 0 (g_7) (0) Syncing...

-------

The last three members seem to hang on syncing only now. These three are remote to where I am.

When I run a status on the replication I can see one member missing:

SERVER ID STATE STATUS QUEUE CONNECTION CHANGED

-----------------------------------------------------------------------

g_2_ccm10_5_2_13900_12 2 Active Local 0

g_3_ccm10_5_2_13900_12 3 Active Connected 0 Jun 28 15:31:21

g_4_ccm10_5_2_13900_12 4 Active Connected 0 Jun 28 15:31:24

g_6_ccm10_5_2_13900_12 6 Active Connected 0 Jul 12 07:37:45

g_7_ccm10_5_2_13900_12 7 Active Connected 0 Jul 13 20:34:26

--------

Diagnostics test:

admin:utils diagnose test

Log file: platform/log/diag3.log

Starting diagnostic test(s)

===========================

test - disk_space : Passed (available: 7099 MB, used: 12529 MB)

skip - disk_files : This module must be run directly and off hours

test - service_manager : Passed

test - tomcat : Passed

test - tomcat_deadlocks : Passed

test - tomcat_keystore : Passed

test - tomcat_connectors : Passed

test - tomcat_threads : Passed

test - tomcat_memory : Passed

test - tomcat_sessions : Passed

skip - tomcat_heapdump : This module must be run directly and off hours

test - validate_network : Passed

test - raid : Passed

test - system_info : Passed (Collected system information in diagnostic log)

test - ntp_reachability : Warning

The host 10.2M2.MMM.6 is not reachable, or it's NTP service is down.

The host 10.2P0.1P1.101 is not reachable, or it's NTP service is down.

Some of the configured external NTP servers are not reachable.

It is recommended that for better time synchronization all of

the NTP servers be reachable.

Please use the OS Admin GUI to add/remove NTP servers.

test - ntp_clock_drift : Passed

test - ntp_stratum : Failed

The reference NTP server is a stratum 5 clock.

NTP servers with stratum 5 or worse clocks are deemed unreliable.

Please consider using an NTP server with better stratum level.

Please use OS Admin GUI to add/delete NTP servers.

skip - sdl_fragmentation : This module must be run directly and off hours

skip - sdi_fragmentation : This module must be run directly and off hours

Diagnostics Completed

--------

NTP Details:

ntpd (pid 27758) is running...

remote refid st t when poll reach delay offset jitter

==============================================================================

+10.2X2.XXX.1 10.2A0.AAA.8 6 u 281 1024 377 42.281 -2.090 1.852

*10.1X0.XXX.4 130.88.200.6 4 u 283 1024 377 220.381 0.128 3.925

+10.2X0.XX.1 10.1A0.AAA.4 5 u 1015 1024 377 1.290 0.466 0.853

10.2M2.MMM.6 .XFAC. 16 u - 1024 0 0.000 0.000 0.000

10.2P2.1P1.101 .XFAC. 16 u - 1024 0 0.000 0.000 0.000

Thanks again

FF

Jitender Bhandari · ‎07-13-2017

Hi,

You can clearly see issue with your NTP

test - ntp_reachability : Warning

The host 10.2M2.MMM.6 is not reachable, or it's NTP service is down.

The host 10.2P0.1P1.101 is not reachable, or it's NTP service is down.

Some of the configured external NTP servers are not reachable.

It is recommended that for better time synchronization all of

the NTP servers be reachable.

Please use the OS Admin GUI to add/remove NTP servers.

test - ntp_clock_drift : Passed

test - ntp_stratum : Failed

Cisco recommend stratum should stay below 4, fix the NTP first and then issue "utils dbreplication reset all" on publisher.

(Rate if it helps)

JB

Forcefield · ‎07-17-2017

I have fixed the issue at last.

Looks like that stratum level 5 does not cause an impact as they are still on the same level.

What I have done since is remove legay servers mentioned above, but I have also rebooted servers in the cluster. Whether it was one of these are a combination I'm not sure.

Thanks for the advice.