Solved: When subscriber is up, cannot make calls to one site

khanfawaz · ‎11-25-2012

Hi,

I have upgraded CUCM from 4.2.3 to CUCM 7.1.5. The upgrade was a success. But after when the subscriber is also upgraded and its live, calls to 2 or 3 sites are not going. When we make the subscriber down, the calls are going with no problem. There are 2 nodes in the cluster. The subscriber replication status stays in 0.

Regards,

Fawaz

Harmit Singh · ‎11-25-2012

Hi Fawaz,

Good day! :-)

Sounds like you got yourself a dbreplication issue. RTMT Status of 0 is not healthy. Can you send me "utils dbreplication runtimestate" from the Publisher CLI please?

Also, please go through my blog and tech-talk video on dbreplication to help you get familiarized with the dbreplication commands:

https://supportforums.cisco.com/community/netpro/collaboration-voice-video/ip-telephony/blog/2012/10/26/understanding-cucm-dbreplication-runtimestate

Regards,

Harmit.

View solution in original post

Harmit Singh · ‎11-25-2012

Hi Fawaz,

Good day! :-)

Sounds like you got yourself a dbreplication issue. RTMT Status of 0 is not healthy. Can you send me "utils dbreplication runtimestate" from the Publisher CLI please?

Also, please go through my blog and tech-talk video on dbreplication to help you get familiarized with the dbreplication commands:

https://supportforums.cisco.com/community/netpro/collaboration-voice-video/ip-telephony/blog/2012/10/26/understanding-cucm-dbreplication-runtimestate

Regards,

Harmit.

khanfawaz · ‎11-25-2012

Hi Harmit,

You again. Great!! Thanks for the reply.

The result of the utils dbreplication runtimestate is given below.

admin:utils dbreplication runtimestate

DB and Replication Services: ALL RUNNING

Cluster Replication State: BROADCAST SYNC Started on 1 server(s) at: 2012-11-25- 15-07

Sync Progress: 0 tables sync'ed out of 427

Sync Errors: NO ERRORS

DB Version: ccm7_1_5_30000_1

Number of replicated tables: 427

Cluster Detailed View from PUB (2 Servers):

PING REPLICATION REPL. DBver& R EPL. REPLICATION SETUP

SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES L OOP? (RTMT) & details

----------- ------------ ------ ---- ----------- ----- -------- ---- -----------------

AUCCM01 10.11.10.235 0.048 Yes Connected 0 match Yes ( 3) PUB Setting Subs

AUCCM02 10.11.10.236 Failed No Active-Dropped 0 ? No ( ?) Setup in Progress

Thanks for the link as well. Now I have started to rebuild the subscriber server once again. Its been third time I am re-building the subscriber today!! There was some network connectivity issue in between. It was not able to connect to the gateway.

One more thing I would like to ask you is, once somewhere I have read that cisco cucm does not support network port teaming. Here we have done network port teaming on the CUCM subscriber server. But not on the Publisher. Please advise?

Regards,

Fawaz

Harmit Singh · ‎11-25-2012

Hi Fawaz,

Ok, if you still have dbreplication issues after rebuilding the Subscriber, let me know and we can look into it.

As for the NIC teaming, it is definitely supported.

NIC Teaming for Network Fault Tolerance

The NIC teaming feature allows a server to be connected to the Ethernet via two NICs and, therefore, two cables. NIC teaming prevents network downtime by transferring the workload from the failed port to the working port. NIC teaming cannot be used for load balancing or increasing the interface speed.

Hewlett-Packard (HP) and IBM server platforms with dual Ethernet network interface cards can support NIC teaming for Network Fault Tolerance.

Hewlett-Packard server platforms with dual Ethernet network interface cards can support NIC teaming for Network Fault Tolerance with Cisco Unified CM 5.0(1) or later releases.

IBM server platforms with dual Ethernet network interface cards can support NIC teaming for Network Fault Tolerance with Cisco Unified CM 6.1(2) and later releases.

set network failover

This command enables and disables Network Fault Tolerance on the Media Convergence Server network interface card.

Command Syntax

failover {enable | disable}

Parameters

•enable enables Network Fault Tolerance.

•disable disables Network Fault Tolerance.

Requirements

Command privilege level: 1

Allowed during upgrade: No

The referenced guides:

SRND

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/srnd/7x/callpros.html#wp1043624

CLI Ref

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/cli_ref/7_1_2/cli_ref_712.html#wp40017

HTH.

Regards,

Harmit.

"NIC Teaming for Network Fault Tolerance

The NIC teaming feature allows a server to be connected to the Ethernet via two NICs and, therefore, two cables. NIC teaming prevents network downtime by transferring the workload from the failed port to the working port. NIC teaming cannot be used for load balancing or increasing the interface speed.

Hewlett-Packard server platforms with dual Ethernet network interface cards can support NIC teaming for Network Fault Tolerance with Cisco Unified CM 5.0(1) or later releases.

IBM server platforms with dual Ethernet network interface cards can support NIC teaming for Network Fault Tolerance with Cisco Unified CM 6.1(2) and later releases. "

And the command to do that as per the CUCM OS CLI Reference Guide is:

"set network failover

This command enables and disables Network Fault Tolerance on the Media Convergence Server network interface card.

Command Syntax

failover {enable | disable}

Parameters

enable enables Network Fault Tolerance.

disable disables Network Fault Tolerance.

Requirements

Command privilege level: 1

Allowed during upgrade: No "

The referenced guides:

SRND

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/srnd/7x/callpros.html#wp1043624

CLI Ref

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/cli_ref/7_1_2/cli_ref_712.html#wp40017

khanfawaz · ‎11-25-2012

Hi Harmit,

Thanks for the reply.

I saw your video. Its very good one. Thanks for sharing that video.

Now I have rebuild the server. After the rebuild, it still showed a replication error. Sub showed the replication value of 0. So I have done a utils dbreplication clusterreset and on top of that I did a utils dbreplication reset all.

While running dbreplication cluseterreset, the replication value came to stable of value 2 for both nodes for a while and then sub became 2 and the pub became 4. I have restarted the publisher 1st and then the subscriber 2nd.

After both the restart, the replication status of pub is 2 and sub is 0. Now how do I move ahead?

Regards,

Fawaz.

Harmit Singh · ‎11-25-2012

Hi Fawaz,

Thanks! Glad to hear you liked it. As for this issue, sounds like the syscdr db is not getting created on the Sub. I would request you to run the following reports in the Unified Reporting Tool GUI page and attach them here so I can see what's going on:

Unified CM Cluster Overview

Unified CM Database Status

Also, kindly attach the latest output of:

++ "utils dbreplication runtimestate", "utils diagnose test", "show tech network hosts", "utils service list" from the Pub

++ "utils diagnose test", "show tech network hosts", "utils service list", "utils network connectivity" from the Sub.

Regards,

Harmit.

khanfawaz · ‎11-25-2012

Hi Harmit,

Thanks for the reply.

I have attached the logs and reports which you have asked for. Since I can attach only 5 files in one reply I am sending it in different replyes.

Regards,

Fawaz

khanfawaz · ‎11-25-2012

Hi Harmit,

Remaining attachements,

Regards,

Fawaz.

khanfawaz · ‎11-25-2012

Hi Harmit,

Remianing attachments.

Regards,

Fawaz.

Harmit Singh · ‎11-25-2012

Hi Fawaz,

Thank you for the logs. From the gathered info, I can see that everything is good and healthy. The one thing that stands out is that there is a connectivity issue between Pub and Sub. You can see this in the "utils diagnose test" and "utils network connectivity" outputs:

test - validate_network : Error, intra-cluster communication is broken, unable to connect to [10.11.10.235]

admin:utils network connectivity

This command can take up to 3 minutes to complete.

Continue (y/n)?y

Running test, please wait ...

.

Test failed with AUCCM01: Could not send/receive UDP packets.

The strange thing is that in the reports you uploaded, I see that the connectivity test is successful:

Unified CM Connectivity

Runs the command 'utils network connectivity' on each node to ensure connectivity.

Connectivity Success for auccm01

Connectivity Success for auccm02

This shouldnt be the case if the CLI for the same test failed. Not sure if you ran any commands in between whether the connectivity was temporarily down, or if there is a potential network issue between the nodes where it keeps going down and coming back up.

You can also see that the runtimestate reflects the following:

PING REPLICATION REPL. DBver& R EPL. REPLICATION SETUP

SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES L OOP? (RTMT) & details

----------- ------------ ------ ---- ----------- ----- -------- ---- -----------------

AUCCM01 10.11.10.235 0.048 Yes Connected 0 match Yes ( 3) PUB Setting Subs

AUCCM02 10.11.10.236 0.262 No Active-Dropped 0 ? No (

Active-Dropped means the Cluster Manager denying access or the DB is down or entire server is down. Since Cluster Manager is responsible for keeping a check on the network connectivity (which is failing), that is the reason why it shows this replication status.

Can you confirm if both the Pub and Sub are connected to the same switch and are in the same location? What is the MTU for both these servers? Default is 1500. You can confirm by running "show network eth0 detail" on the Pub and "

show network failover" on the Sun since fault tolerance is configured on it. By the way, why do you have it enabled on the Sub and not the Pub? Ideally, you want to keep this consistent for all nodes in the cluster. Until this connectivity issue is sorted out, till then, no matter what command you run, the replication wont stay up.

Hope the above helps.

Regards,

Harmit.

khanfawaz · ‎11-25-2012

Hi Harmit,

Thanks for such a good explanation.

Yes both the servers are connected to same switch and are in the same location. I have not changed the MTU values while server installation. I have attached the outputs of publisher-show network eth0 and subscriber-show network failover.

I have not enabled fault tolerance anywhere.

So how do I proceed further? Any advise?

Regards,

Fawaz

Harmit Singh · ‎11-25-2012

Hi Fawaz,

Thanks for the info. I might have gotten confused with something you mentioned earlier then.

++ Can you check to see and make sure there is no speed/duplex mismatch between the servers and the switchports? Kindly pull "show network eth0 detail" output from both Pub and Sub. The Pub's output was not detailed.

++ Check for any interface errors on the switchports where the 2 servers are connected.

++ Try restarting the cluster manager server: "utils service restart Cluster Manager" on both servers.

++ Try disabling CSA on both servers. The command is "utils csa disable". Please note: The server will reboot when you disable CSA:

admin:utils csa disable

Need a system restart for changes to take effect

Enter "yes" to continue and restart or any other key to abort

:

After doing so, run "utils diagnose test" and "utils network connectivity" from the Sub and upload those outputs. We need to make sure the connectivity issue is sorted out before we attempt to fix the replication.

Regards,

Harmit.

khanfawaz · ‎11-26-2012

Hi Harmit,

Thanks for the reply.

I have worked with TAC and resolved it. There was two issues actually.

1) The security password given in Publisher and Subscriber during the installation were different. This passwords are typed in by the customer. They do not share that details with us. TAC found that the hash value for the security password in the root login of both the servers were different. So we changed both the security passwords.

2) There was a NTP issue. The upgrade on the servers were done on lab network and we had given a different ntp address. When it came to the production we changed the ntp address to a real one. So the TAC found that there was NTP certificate issue. It was also later resolved by the TAC.

Now the subscriber is up and all the calls are going. There is another issue now. All the phones are registered but the status in the Device>Phone are as unknown. Both the IP address and their status are unknown. But no problems with any of the calls. Need to look into that later today.

Regards,

Fawaz

Harmit Singh · ‎11-26-2012

Hi Fawaz,

Great info! Thanks for sharing (+5 for doin so)!

I would have ideally needed access myself to resolve this for you, good thing you opened a TAC case for it. For the NTP issue, what I can tell you is that if there is a time difference of over 30 minutes between the nodes, the replication will go down.

For the other issue, sounds cosmetic in nature. Please go to the Unified Serviceability page --> Tools --> Control Center - Network Services --> Restart "Cisco RIS Data Collector" service on both nodes and check to see if the phone status shows up correctly now.

HTH.

Regards,

Harmit.

khanfawaz · ‎11-26-2012

Hi Harmit,

Thanks for the reply.

I did the Cisco RIS collector service restart pn both the nodes. Still it was not fixed, then I raised another TAC and after his analysis he said that its a bug. Bug No: CSCtd91156.

Regards,

Fawaz.