Finesse Connection Lost

Talha Anjum · ‎12-03-2015

Hello Everyone,

Platform: UCCX 10.x

I am facing connection lost issue again and again (snapshot attached) in finesse .

I think this is a network issue .can anyone help how can i proof this with logs,

I have collected UCCX engine logs, cisco finesse logs as well MIVR logs but getting no usefull

Waiting for Response, thanks

Deepak Rawat · ‎12-03-2015

To start with, provide below information first:

Exact version of UCCX?

Is this a single node or HA? In case, if HA then what type of HA "LAN or WAN"

Exact version of Call manager?

Since when did the problem appear?

Is the problem affecting on all agents or only some of the agents?

Are the agents remote?

When the agents see this Lost Connection problem, do they connect to other server (if you have HA there) or connect back to the original server through which it was working fine earlier

In case if you have HA, have you ever tried to do the intentional failover that is shutting down the CCX engine manually on Master server and see if agents get registered with secondary server or not. In case, it does not work then you probably have to get the workaround done for this defect as well:

Status of NTP on the UCCX servers. Is it synschorized properly or not. You can check the same by running utils ntp status from UCCX server CLI

Are you getting below kind of error messages in the logs, you will need to capture Application/System logs as well along with Finesse, MIVR and MCVD to see these error messages:

test - ntp_reachability : Warning

The NTP service is restarting, it can take about 5 minutes.

test - ntp_clock_drift : Warning

The local clock is not synchronised.

None of the designated NTP servers are reachable/functioning or legitimate.

test - ntp_stratum : Warning

The local clock is not synchronised.

None of the designated NTP servers are reachable/functioning or legitimate.

In case if you have HA setup and if both the servers are losing network connectivity between each other causing this issue, then you will see something like below in MCVD logs:

%MCVD-CVD-5-HEARTBEAT_MISSING_HEARTBEAT:CVD does not receive heartbeat from node long enough: nodeId=1,dt=1656

Rather than proving, I think you should work towards resolving this else it will keep causing more issues for the agents and your setup. It will be good if you can open a TAC case so that they can set the trace configuration accordingly and capture the required logs once the issue happens again to provide a detailed RCA and get to root of this, whatever it is then your network issues, server issues, PC problems whatever.

Regards

Deepak

- Rate Helpful Posts -

muhammad ali · ‎12-03-2015

muhammad ali · ‎12-03-2015

Hello Deepak,

Thank you for your brief reply.

Talha is my colleage and below is the information which you required.

Is this a single node or HA? This is HA cluster over the WAN

Exact version of Call manager? 9.1.2.12900-11

Since when did the problem appear? From some weeks but customer reported this issue appear only in noon time.

Is the problem affecting on all agents or only some of the agents? Some of agents

Are the agents remote? Currently agent are on local site

We have HA cluster but in the past when we were uploading a script on publisher uccx it gives us a message that uccx engine is not running on remote node so to temporary upload script we have disable replication but when we tried to enable replication again it gives below error message.

Enabling of subscriber config datastore and historical datastore Failed : Config Controller for CRS Config Datastore on node 2 : can not execute enable() because node 2 is not active
Although node2 status is active

We have also tried to test failover scenario but the cti service on uccx sub goes in partial state and after digging further we have found that no cti ports is registered with sub uccx so we have added halfs of cti ports on pub and half of ports to sub. But due to short downtime we haven't test failover that time and after that this replication issue appears.

I have checked MIVR,MCVD and application system logs and found below logs.

"The local NTP client is off by more than the acceptable threshold of 3 seconds from its remote NTP system peer. The normal remedy is for NTP Watch Dog to automatically restart NTP. However, an unusual number of automatic NTP restarts have already occurred on this node. No additional automatic NTP restarts will be done until NTP time synchronization stabilizes."

Above message i have seen continously in logs.

MCVD Logs

Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at sun.nio.ch.Net.connect(Native Method)
34683374: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:532)
34683375: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.nio.SocketConnector.tryToConnect(SocketConnector.java:105)
34683376: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.nio.SocketConnector.run(SocketConnector.java:53)
34683377: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.util.executor.ManagedExecutorService$Worker.run(ManagedExecutorService.java:166)
34683378: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
34683379: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
34683380: Dec 02 14:44:41.546 EAT %MCVD-LIB_TPL-7-EXCEPTION: at java.lang.Thread.run(Thread.java:662)
34683381: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)
34683382: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Connection refused to address /10.95.100.72:5900: Exception=java.net.SocketException: Connection refused to address /10.95.100.72:5900
34683383: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION:java.net.SocketException: Connection refused to address /10.95.100.72:5900
34683384: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at sun.nio.ch.Net.connect(Native Method)
34683385: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:532)
34683386: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.nio.SocketConnector.tryToConnect(SocketConnector.java:105)
34683387: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.nio.SocketConnector.run(SocketConnector.java:53)
34683388: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.util.executor.ManagedExecutorService$Worker.run(ManagedExecutorService.java:166)
34683389: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
34683390: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
34683391: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at java.lang.Thread.run(Thread.java:662)
34683392: Dec 02 14:44:41.547 EAT %MCVD-LIB_TPL-7-EXCEPTION: at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)
34683393: Dec 02 14:44:42.236 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683394: Dec 02 14:44:43.242 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683395: Dec 02 14:44:44.248 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683396: Dec 02 14:44:45.257 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683397: Dec 02 14:44:46.263 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683398: Dec 02 14:44:47.268 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683399: Dec 02 14:44:48.278 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683400: Dec 02 14:44:49.284 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683401: Dec 02 14:44:50.289 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683402: Dec 02 14:44:51.295 EAT %MCVD-DB_MGR-7-UNK:EntityDataSource.checkConnectivity for NIC-contactcenter1 is true
34683403: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.cluster.TcpIpJoiner, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Address[10.90.100.72]:5900 is local? true
34683404: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.cluster.TcpIpJoiner, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Address[10.90.100.72]:5900 is local? true
34683405: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.cluster.TcpIpJoiner, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Address[10.95.100.72]:5900 is local? false
34683406: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.cluster.TcpIpJoiner, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Address[10.95.100.72]:5900 is local? false
34683407: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.cluster.TcpIpJoiner, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Address[10.90.100.72]:5900 is connecting to Address[10.95.100.72]:5900
34683408: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.cluster.TcpIpJoiner, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Address[10.90.100.72]:5900 is connecting to Address[10.95.100.72]:5900
34683409: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Starting to connect to Address[10.95.100.72]:5900
34683410: Dec 02 14:44:51.547 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Starting to connect to Address[10.95.100.72]:5900
34683411: Dec 02 14:44:51.548 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Connecting to /10.95.100.72:5900, timeout: 0, bind-any: true
34683412: Dec 02 14:44:51.548 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Connecting to /10.95.100.72:5900, timeout: 0, bind-any: true
34683413: Dec 02 14:44:51.549 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Could not connect to: /10.95.100.72:5900. Reason: SocketException[Connection refused to address /10.95.100.72:5900]
34683414: Dec 02 14:44:51.549 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Could not connect to: /10.95.100.72:5900. Reason: SocketException[Connection refused to address /10.95.100.72:5900]
34683415: Dec 02 14:44:51.549 EAT %MCVD-LIB_TPL-7-UNK:source=com.hazelcast.nio.SocketConnector, message=[10.90.100.72]:5900 [UccxCvdCluster-1413196896000] Connection refused to address /10.95.100.72:5900: Exception=java.net.SocketException: Connection refused to address /10.95.100.72:5900
34683416: Dec 02 14:44:51.549 EAT %MCVD-LIB_TPL-7-EXCEPTION:java.net.SocketException: Connection refused to address /10.95.100.72:5900

10.95.100.72 is the second node in cluster from whcih replication is stopped.
yes we are now thinking to open a TAC issue so engineer can check server.

Regards,
Muhammad Ali