ā03-04-2014 02:16 PM - edited ā03-16-2019 10:00 PM
I have a client who is getting random disconnect errors from his cucm servers. They seem to encompass other network errors at times as well, but not always. I'm having a hard time telling if its cucm or if it is a network issue in general.
I pulled his core logs and at the time of disconnect messages from RTMT I'm seeing this:
|StationInit: TCPPid = [1.100.9.58957]Socket Broken. DeviceName=SEP002584189723,IPAddr=10.100.7.49, Port=0xc702, Device Controller=[1,51,5397]|1,100,50,1.20646898^10.100.7.49^SEP002584189723
11:35:48.006 |StationInit: TCPPid = [1.100.9.58948]Socket Broken. DeviceName=,IPAddr=10.101.1.42, Port=0xcbf2, Device Controller=[0,0,0]|1,100,55,3.1^*^*
11:35:48.050 |StationInit: TCPPid = [1.100.9.58250]Socket Broken. DeviceName=,IPAddr=10.101.1.112, Port=0xc436, Device Controller=[0,0,0]|1,100,55,3.1^*^*
11:35:48.067 |StationInit: TCPPid = [1.100.9.59376]Socket Broken. DeviceName=,IPAddr=10.101.1.16, Port=0xc55e, Device Controller=[0,0,0]|1,100,55,3.1^*^*
11:35:48.099 |StationInit: TCPPid = [1.100.9.58840]Socket Broken. DeviceName=,IPAddr=10.101.1.8, Port=0xc577, Device Controller=[0,0,0]|1,100,55,3.1^*^*
The Cisco UP Presence Engine service on the peer node of a subcluster has failed AppID : Cisco Syslog Agent ClusterID :
|StationInit: TCPPid = [1.100.9.58957]Socket Broken. DeviceName=SEP002584189723,IPAddr=10.100.7.49, Port=0xc702, Device Controller=[1,51,5397]|1,100,50,1.20646898^10.100.7.49^SEP002584189723
11:35:48.006 |StationInit: TCPPid = [1.100.9.58948]Socket Broken. DeviceName=,IPAddr=10.101.1.42, Port=0xcbf2, Device Controller=[0,0,0]|1,100,55,3.1^*^*
11:35:48.050 |StationInit: TCPPid = [1.100.9.58250]Socket Broken. DeviceName=,IPAddr=10.101.1.112, Port=0xc436, Device Controller=[0,0,0]|1,100,55,3.1^*^*
11:35:48.067 |StationInit: TCPPid = [1.100.9.59376]Socket Broken. DeviceName=,IPAddr=10.101.1.16, Port=0xc55e, Device Controller=[0,0,0]|1,100,55,3.1^*^*
11:35:48.099 |StationInit: TCPPid = [1.100.9.58840]Socket Broken. DeviceName=,IPAddr=10.101.1.8, Port=0xc577, Device Controller=[0,0,0]|1,100,55,3.1^*^*
and lots of them, for a few minutes and then it stops. When I search around for solutions for this I see "contact TAC" repeatedly. Can anyone help me determine if I have a cm problem or a network problem. He did reboot the cluster 12 days ago and it didn't stop the problems. The disconnect errors related to all components of the phone system (cucm, cuc, uccx, cups) and have been anything from "SDL link to remote application is out of service", "OUT-OF-SERVICE AppID : Cisco UP Presence Engine ClusterID", "
The Cisco UP Presence Engine service on the peer node of a subcluster has failed AppID : Cisco Syslog Agent ClusterID", "user 2 ntpRunningStatus.sh: Primary node NTP server, SVUCCX01, is currently inaccessible or down"
So its just like all communication breaks and then comes back again. I'd like to do as much leanring as possilble with this and not just run to TAC. Any suggestions?
ā03-04-2014 04:04 PM
Just added full sdi trace file - if it helps
ā03-04-2014 07:48 PM
Hi Dustin,
This needs to be tackled by concentrating on one specific error at a time. Corresponding to the time stamp of the error it needs to be checked if there is any impact on the performance of the server / device mentioned in the error. Corresponding traces ( detailed cucm or others depending on error ) need to be collected covering a duration of a few minutes prior to the error and leading upto the error. If you have all the specifc details for one event as described above please post the same to be looked at.
Couple of important links:
Set Up Cisco CallManager Traces for Cisco Technical Support
Troubleshooting Guide for Cisco Unified Communications Manager
HTH
Manish
ā03-05-2014 12:00 AM
Hi Dustin,
Please see the explaination for the error you are getting:-
CCM_CALLMANAGER-CALLMANAGER-1-SDLLinkOOS : SDL link to remote application is out of service Remote Application IP Address [String] Unique Link ID [String] Local Node ID [UInt] Local Application ID [Enum]Remote Node ID [UInt] Remote Application ID [Enum]
Explanation This alarm indicates that the local Unified CM has lost communication with the remote Unified CM. This alarm usually indicates that a node has gone out of service (whether intentionally for maintenance or to install a new load for example; or unintentionally due to a service failure or connectivity failure).
Recommended Action In the Cisco Unified Reporting tool, run a CM Cluster Overview report and check to see if all servers can communicate with the Publisher. Also check for any alarms that might have indicated a CallManager failure and take appropriate action for the indicated failure. If the node was taken out of service intentionally, bring the node back into service.
Reason Code - Enum Definitions
Enum Definitions - LocalApplicationID
Value | Definition |
---|---|
100 | CallManager |
Enum Definitions - RemoteApplicationID
Value | Definition |
---|---|
100 | CallManager |
The most common Lost communication for CallManagers could be Callmanager server hang, network problems or high CPU
You can use RTMT to monitor this.
CM servers keep a TCP connection to other servers in the cluster. When that TCP connection is broken due to network connectivity or lack of server resources, the above error is generated.
Also the sdi traces you have provided won't be useful to analyze the problem but rather please take detailed sdl traces.
If possible then take SDL detailed traces with sniffer filtering TCP port 8002 in order to find out if this is a Network problem or not.
Sometimes due to highly fragmented disk can cause heavy disk I/O utilizing all the CPU.
Was there any upgradation or migration activity was carried out in network?
Please check the server NIC and switch NIC or other NIC have the same speed/duplex settings.
Regards,
Nishant Savalia
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide