Re: Problem UCCX HA Intermittently in state UNKNOWN

herus · ‎12-16-2021

Good day,

Recently I encounter problem with UCCX HA (11.5.1.11001-34 (ES03-18)).

From UCCX primary, I found that UCCX secondary services is UNKNOWN and then in a few second back to IN SERVICE.

I tried to collect MCVD log as follow

7198392: Dec 16 14:06:50.273 WIB %MCVD-CVD-7-UNK:[hz._hzInstance_1_UccxCvdCluster-1396580705000.cached.thread-24] ClusterServiceImpl: Hazelcast.memberRemoved: member=Member [192.168.100.172]:5900
7198393: Dec 16 14:06:50.280 WIB %MCVD-CVD-4-HEARTBEAT_SUSPECT_NODE_CRASH:[hz._hzInstance_1_UccxCvdCluster-1396580705000.cached.thread-24] ClusterViewManager: CVD suspects node crash: state=HEARTBEAT_HAZELCAST,nodeInfo=Node[nodeId=2, ip=192.168.100.172],dt=null
7198394: Dec 16 14:06:50.280 WIB %MCVD-CVD-7-UNK:[MCVD_CVD_DISPATCHER-5-0-com.cisco.cluster.impl.cvd.Dispatcher1] Dispatcher1: >> try to process HeartbeatNodeLeaveCmdImpl nodeId=2
7198395: Dec 16 14:06:50.281 WIB %MCVD-CVD-3-NODE_LEAVE_CLUSTER:[MCVD_CVD_DISPATCHER-5-0-com.cisco.cluster.impl.cvd.Dispatcher1] Dispatcher: Node leave cluster: nodeId=2
7198396: Dec 16 14:06:50.525 WIB %MCVD-DB_MGR-7-UNK:[Thread-36] EntityDataSource: EntityDataSource.checkConnectivity for IDUCCXPrimary is true
7198397: Dec 16 14:06:50.527 WIB %MCVD-CLUSTER_MGR-7-UNK:[MCVD_CLUSTER_MGR_DISPATCHER-7-0-com.cisco.cluster.impl.manager.Dispatcher] Log: try to process NodeLeaveCmdImpl, nodeId=2
7198398: Dec 16 14:06:50.631 WIB %MCVD-CLUSTER_MGR-7-UNK:[MCVD_CLUSTER_MGR_DISPATCHER-7-0-com.cisco.cluster.impl.manager.Dispatcher] Log: process Node Leave, id=2
7198399: Dec 16 14:06:50.631 WIB %MCVD-CLUSTER_MGR-7-UNK:[MCVD_CLUSTER_MGR_DISPATCHER-7-0-com.cisco.cluster.impl.manager.Dispatcher] Log: Node 2 change state from IN SERVICE to UNKNOWN
7198400: Dec 16 14:06:50.631 WIB %MCVD-CVD-7-UNK:[MCVD_CVD_DISPATCHER-5-0-com.cisco.cluster.impl.cvd.Dispatcher1] PublisherImpl: removeSubscriber 2
7198401: Dec 16 14:06:50.631 WIB %MCVD-DB_MGR-7-UNK:[EventQueue.DispatchThread-0-2] NodeEvent: CuicDataSourceUpdateImpl.nodeLeaving() - invoked for Node 2
7198402: Dec 16 14:06:50.631 WIB %MCVD-DB_MGR-7-UNK:[DatabaseManagerImpl-Listener-Handler] DbServiceStateEventHandlerTask: DatabaseManagerImpl.DbServiceStateListener[UNKNOWN] - received for NodeId=2
7198403: Dec 16 14:06:50.631 WIB %MCVD-DB_MGR-7-UNK:[DatabaseManagerImpl-Listener-Handler] DatabaseManagerImpl: DatabaseManagerImpl.logCurState() - localNodeId=1, localHostName=IDUCCXPrimary, remoteHostName=IDUCCXSecondary, localDBServerName=iduccxprimary_uccx, remoteDBServerName=iduccxsecondary_uccx, localCDSEnable=true, remoteCDSEnable=true, isLocalDbUp=true, isRemoteDbUp=false, masterDBHostName=IDUCCXPrimary
7198404: Dec 16 14:06:50.631 WIB %MCVD-DB_MGR-7-UNK:[DatabaseManagerImpl-Listener-Handler] EntityDataSource: Ignoring mastership change event as IDUCCXPrimary is the current master
7198405: Dec 16 14:06:50.631 WIB %MCVD-BOOTSTRAP_MGR-7-UNK:[EventQueue.DispatchThread-0-2] BootstrapManagerImpl: BootstrapNodeListenerImpl.nodeLeaving() - node leave received for node=192.168.100.172
7198406: Dec 16 14:06:50.631 WIB %MCVD-CFG_MGR-7-UNK:[EventQueue.DispatchThread-0-2] BSAccessor: BootstrapAccessor.repositoryShutdown->with IP address 192.168.100.172
7198407: Dec 16 14:06:50.632 WIB %MCVD-CVD-7-UNK:[MCVD_CVD_DISPATCHER-5-0-com.cisco.cluster.impl.cvd.Dispatcher1] Dispatcher: Processing HeartbeatConvergenceCompletedCmdImpl, activeNodes = {}
7198408: Dec 16 14:06:50.632 WIB %MCVD-DB_MGR-7-UNK:[Thread-4468] CuicDataSourceUpdateImpl: CuicDataSourceUpdateImpl.updateCuicDatasource() - nodeIp=192.168.100.171. Its DB state=IN SERVICE, isDbMaster=true
7198409: Dec 16 14:06:50.632 WIB %MCVD-CVD-7-UNK:[MCVD_CVD_DISPATCHER-5-0-com.cisco.cluster.impl.cvd.Dispatcher1] Dispatcher: Processing HeartbeatConvergenceCompletedCmdImpl, activeNodes = {}, activeNodes.length=0, nmStarted=true
7198410: Dec 16 14:06:50.632 WIB %MCVD-DB_MGR-7-UNK:[Thread-4468] CuicDataSourceUpdateImpl: CuicDataSourceUpdateImpl.updateCuicDatasource() - nodeIp=192.168.100.172. Its DB state=UNKNOWN, isDbMaster=false
7198411: Dec 16 14:06:50.632 WIB %MCVD-CVD-7-UNK:[EventQueue.DispatchThread-0-2] BootstrapListenerImpl: BootstrapListenerImpl cvd IGNORE repositoryShutdown() arg0=192.168.100.172
7198412: Dec 16 14:06:50.632 WIB %MCVD-DB_MGR-7-UNK:[Thread-4468] CuicDataSourceUpdateImpl: CuicDataSourceUpdateImpl.updateCuicDatasource() - dataSourceServer: 192.168.100.171
7198413: Dec 16 14:06:50.632 WIB %MCVD-BOOTSTRAP_MGR-7-UNK:[EventQueue.DispatchThread-0-2] BootstrapManagerImpl: BootstrapNodeListenerImpl.nodeLeaving() - repositoryShutdown delivered with shutdown addr=192.168.100.172
7198414: Dec 16 14:06:50.632 WIB %MCVD-REST_CLIENT-7-UNK:[Thread-4468] CUICRestClient: CUICRestClient() - restServerIpAddr=localhost
7198415: Dec 16 14:06:50.633 WIB %MCVD-CVD-7-UNK:[MCVD_CVD_DISPATCHER-5-0-com.cisco.cluster.impl.cvd.Dispatcher1] CVDMasterDatabaseImpl: DB service bestCandidate runs on nodeId=1 because master engine is running on this node
7198416: Dec 16 14:06:50.633 WIB %MCVD-DB_MGR-7-UNK:[DatabaseManagerImpl-Listener-Handler] NodeStateEventHandlerTask: DatabaseManagerImpl.NodeListener[NODE_LEAVING] - received for NodeId=2
7198417: Dec 16 14:06:50.838 WIB %MCVD-DB_MGR-7-UNK:[DatabaseManagerImpl-Listener-Handler] DatabaseManagerImpl: DatabaseManagerImpl.logCurState() - localNodeId=1, localHostName=IDUCCXPrimary, remoteHostName=IDUCCXSecondary, localDBServerName=iduccxprimary_uccx, remoteDBServerName=iduccxsecondary_uccx, localCDSEnable=true, remoteCDSEnable=true, isLocalDbUp=true, isRemoteDbUp=false, masterDBHostName=IDUCCXPrimary
7198418: Dec 16 14:06:50.838 WIB %MCVD-DB_MGR-7-UNK:[DatabaseManagerImpl-Listener-Handler] EntityDataSource: Ignoring mastership change event as IDUCCXPrimary is the current master

I don't know what cause UCCX secondary is removed from the cluster and then add back again.

When I logged in to UCCX Secondary, there is no service down or system reboot.

Anyone experience this issue?, Need your advice.

Thank you

Ravi Shankar Pandit · ‎12-17-2021

Hi Herus ,

Can you please confirm if its HAoWAN or LAN ?

7198392: Dec 16 14:06:50.273 WIB %MCVD-CVD-7-UNK:[hz._hzInstance_1_UccxCvdCluster-1396580705000.cached.thread-24] ClusterServiceImpl: Hazelcast.memberRemoved: member=Member [192.168.100.172]:5900
7198393: Dec 16 14:06:50.280 WIB %MCVD-CVD-4-HEARTBEAT_SUSPECT_NODE_CRASH:[hz._hzInstance_1_UccxCvdCluster-1396580705000.cached.thread-24] ClusterViewManager: CVD suspects node crash:

from the logs its evident that both node lost communication .

Possible reason

==>Network Issue (NTP/DNS/Packet drop etc)

==>High CPU utilization on one node , this will also delays the heartbeat between two nodes

==> Hardware issue (which i doubt , because this is intermittent )

Regards

Ravi

herus · ‎12-17-2021

Hi Ravi,

Thank you for your reply, this is a HA over LAN.

When the problem occur I see ping from my PC to UCCX Primary and UCCX Secondary not stable, a few with high latency and one or two request timed out, and then ping back to normal.

Does this can cause HA problem ?