Connectivity issues for random phones in CUCM 8.6 on Vmware

adix1 · ‎01-13-2015

I have a client that has problems with around half of there ip phones (6921 & 7945) after a situation with overheating in the serverroom.

The solutions was a 1 Publisher, 1 subscriber setup with each on a separate Vmware server.

The Publisher is connected to the Core Switch which is further connected to 4 distribution switches which again connect to 5 more distribution switches. They have around 200 ip phones connected to these switches that get assigned IP's with DHCP.

The physical disk that the Subscriber was on was destroyed the incident. The Publisher they recovered, but to do this they had to recreate the vm descriptor file and attach it to the flat-file.

So the current situation is that the system is running with only 1 Publisher and no Subscriber.

There is also a license warning in the CCM Admin section, stating a License Overage (2 nodes used, but only 1 licensed.)

The license status is not invalid though, and the license state is "Uploaded". This second node I suppose might be the Subscirber that no longer exists? The phones have more than enough licenses.

When powered up the system seems to run as it should, but only about half the phones have connectivity to and can register with CUCM.

I have tried to reboot some phones remotely by cutting the power on the switch interfaces where they are attached, but that made no difference.

The console log on the phones that are down show TFTP Timeout & File Transfer Error.

The phones that are up and running can be pinged successfully from the CUCM cli, but when pinging the others I get "Destination Host Unreachable".

The strange thing is that it seems completely random as to what phones are up or down. On all switches there are connected phones with both working and non-working connections to CUCM.

To try to pinpoint the fault I chose 2 devices on the same switch and compared the config for each interface, one that has connectivity and is registered, and one that does not have connectivity and is unreachable.

Everything seems to be identical so I can't see what causes this error on the one, but not on the other.

Also when I ping the ip phone with no connectivity to CUCM from any of the switches, the ping is successful.

Anyone know what could be the cause of this behaviour?

Wilson Samuel · ‎01-13-2015

>the current situation is that the system is running with only 1 Publisher and no Subscriber.

This should Not be an issue, as long as Publisher has enough resources. May throw some Warning messages but should not warrant anything else (for now).

Here is what I would want to know:

1. The Phones that are not registering, are they getting IP Addresses from the DHCP Server?

2. Do all the phones belong to the same Device Pool and or CCM Group? It is possible that it was by mistake put to a CCM Server group with only Subscriber in the Group.

3. Finally, could you please use the RTMT to check what all kinds of err messages you see in the RTMT ?

HTH

adix1 · ‎01-13-2015

Hello and thank you for your time.

The phones get there IP’s from a DHCP Server (Not the built in one in CUCM, but an external server). I have tested that the DHCP process works by powering off a phone (one of the phones with connectivity issues), deleting the lease, and then powering the phone back on and seeing the device request and receive a new lease.
There is only one CCM group active and both servers are in this group with the Publisher as highest priority server. All phones are in the same Device Pool attached to this group.
There are a lot of warnings regarding dbReplication & missing node due to the subscriber being gone. Also there is an issue with NTP server (see below). I will post more on this tomorrow morning.

I have also attached the console log from one of the phones that can't register.

NTP Alert:

At Tue Jan 13 21:24:10 BRST 2015 on node 192.168.50.2; the following SyslogSeverityMatchFound events generated: SeverityMatch : Critical MatchedEvent : Jan 13 21:23:49 CUCMPUB user 2 ntpRunningStatus.sh: Primary node NTP server; 192.168.50.6; is currently inaccessible or down. Verify the network between the primary and secondary nodes. Check the status of NTP on both the primary and secondary nodes via CLI 'utils ntp status'. If the network is fine; try restarting NTP using CLI 'utils ntp restart'. AppID : Cisco Syslog Agent ClusterID : NodeID : CUCMPUB TimeStamp : Tue Jan 13 21:23:49 BRST 2015 SeverityMatch : Critical MatchedEvent : Jan 13 21:23:53 CUCMPUB user 2 ntpRunningStatus.sh: The local NTP client is off by more than the acceptable threshold of 3 seconds from its remote NTP system peer. The normal remedy is for NTP Watch Dog to automatically restart NTP. However; an unusual number of automatic NTP restarts have already occurred on this node. No additional automatic NTP restarts will be done until NTP time synchronization stabilizes. This is likely due to an excessive number of VMware Virtual Machine migrations or Storage VMotions. Please consult your VMware Infrastructure Support Team. AppID : Cisco Syslog Agent ClusterID : NodeID : CUCMPUB TimeStamp : Tue Jan 13 21:23:53 BRST 2015

Terry Cheema · ‎01-13-2015

1) Can you post the following from the CLI of your Pub:

utils dbreplication runtimestate

2) Can you confirm from the DHCP scope in the option 150 - publisher server is listed (first preferably in this case)

3) You can try restarting the TFTP service on the publisher

-Terry

Aman Soi · ‎01-13-2015

Hi,

the attached logs for non-working IP phone point to Domain.

Any change related to Domain done or check for reachability of DNS along with NTP.

if possible, attach logs of working 6921 IP phone.

regds,

aman

Dennis Mink · ‎01-13-2015

Assuming that is a layer 3/4 issue. can you ping one of the failing phones from you PUB's CLI?

also, it would be interesting to SPAN the switchport of a failed phone and run Wireshark. and see if the phone is trying to connect to your cucm (by means of its received option 150).

In addition to this, if you change the failing phone to a port that you know is working, does the problem follow the phone?

factory default the phone before you try all this out.

Please remember to rate useful posts, by clicking on the stars below.

adix1 · ‎01-14-2015

Hi all,

I found the problem.

There are 2 interfaces connected to the vmware server in question and the second interface is in standby mode in vmware.

On the switch both interfaces are up and have a port-channel group in active mode, load-balancing the traffic between the two interfaces. As the second interface is in standby mode in vmware, the devices that were trying to connect through this interface did not reach the server.

I have disabled the second interface on the switch and the phones are now registering with cucm as all traffic is now going through the same interface.

Alternatively I can activate the second nic in vmware as well and bring up the interface on the switch again to activate load-balancing on the interfaces.

Thank you all for your input!

Aman Soi · ‎01-14-2015

thanks for update[+5]

regds,

aman