cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1923
Views
30
Helpful
5
Replies

How can I verify high availability through CUCM / RTMT logs?

elliottklein
Level 1
Level 1

I have a cluster of two CUCM 10.5.1 nodes running as publisher / subscriber.  Theoretically high availability was configured by a previous engineer.  We had a short site-wide outage at the publisher location last night.  No phone issues were noticed but the outage was short.  How can I verify my subscriber took over as publisher?  The publisher never shut down but all network connections were interrupted.

I've dug through Real Time Monitoring Tool and Google searches and haven't been able to find the specific service that would answer my question.

1 Accepted Solution

Accepted Solutions

nathanjgrace
Level 1
Level 1

Hi Elliott, to be clear, during a failover event, the subscriber doesn't "take over" as publisher.  The pub/sub designation is mostly just a reference to the authoritative source for the database during normal operations.  Depending on your topology, the subscriber should be the primary call processing server and should primarily provide some other services if configured.  That said, there are a few factors which would lead me to make the pub the primary server for a group of phones.

I suspect you want to make sure that all services would be unaffected during an event where one server is not reachable, or if services crash on one server.  You need to check a few things to make sure there will be minimum impact for these events:

1. Are all necessary services activated and running on both servers? - Check Cisco Unified Servicability.

2. Is DB replication good? - run 'utils dbreplication runtimestate' from the CLI, you should see 2 for Good.  You can also check database replication health from Cisco Unified Reporting.

3. CM groups in Device Pools - Make sure that both servers are in the CM Groups referenced in the Device Pools.  The order of the servers will determine the registration/call processing preference.  If one server is missing, the phones will not fail over to another server in an outage.

5. CM references in gateways - Depending on if you are using MGCP or H323/SIP, you will need to make sure that both servers are referenced.  MGCP would use 'ccm-manager redundant host' or you would need to have dial peers configured pointing to each CM (or you can use voice-class server group).

6. Are both servers reachable by all endpoints (phones, gateways, etc)?  This is more a routing/security question.  Obviously all phones and gateways should be able to reach both servers using optimal routing, over all necessary ports, and with QoS configured if needed.

7. Is TFTP configured as an array of both servers?  If running on MS Windows, make sure that the TFTP option references both servers.  The order will determine which server the phone looks to first for it's configuration on boot up.  Ideally both servers will be there.

8. Any other references to the CM servers should show both.  CUC SIP integrations, UC Service profiles, Attendant Console, or any other platforms which reference CM should reference both in their preferred order.

You should ABSOLUTELY shut down one server during a maintenance window and place inbound/outbound calls, check VM, etc.  Depending on what other apps you are running, you might find that some things don't quite work properly despite following the above steps.  Bring the server back up, verify all services are running, and take down the other one and run all tests again.

Hope this helps!

-Nathan

View solution in original post

5 Replies 5

nathanjgrace
Level 1
Level 1

Hi Elliott, to be clear, during a failover event, the subscriber doesn't "take over" as publisher.  The pub/sub designation is mostly just a reference to the authoritative source for the database during normal operations.  Depending on your topology, the subscriber should be the primary call processing server and should primarily provide some other services if configured.  That said, there are a few factors which would lead me to make the pub the primary server for a group of phones.

I suspect you want to make sure that all services would be unaffected during an event where one server is not reachable, or if services crash on one server.  You need to check a few things to make sure there will be minimum impact for these events:

1. Are all necessary services activated and running on both servers? - Check Cisco Unified Servicability.

2. Is DB replication good? - run 'utils dbreplication runtimestate' from the CLI, you should see 2 for Good.  You can also check database replication health from Cisco Unified Reporting.

3. CM groups in Device Pools - Make sure that both servers are in the CM Groups referenced in the Device Pools.  The order of the servers will determine the registration/call processing preference.  If one server is missing, the phones will not fail over to another server in an outage.

5. CM references in gateways - Depending on if you are using MGCP or H323/SIP, you will need to make sure that both servers are referenced.  MGCP would use 'ccm-manager redundant host' or you would need to have dial peers configured pointing to each CM (or you can use voice-class server group).

6. Are both servers reachable by all endpoints (phones, gateways, etc)?  This is more a routing/security question.  Obviously all phones and gateways should be able to reach both servers using optimal routing, over all necessary ports, and with QoS configured if needed.

7. Is TFTP configured as an array of both servers?  If running on MS Windows, make sure that the TFTP option references both servers.  The order will determine which server the phone looks to first for it's configuration on boot up.  Ideally both servers will be there.

8. Any other references to the CM servers should show both.  CUC SIP integrations, UC Service profiles, Attendant Console, or any other platforms which reference CM should reference both in their preferred order.

You should ABSOLUTELY shut down one server during a maintenance window and place inbound/outbound calls, check VM, etc.  Depending on what other apps you are running, you might find that some things don't quite work properly despite following the above steps.  Bring the server back up, verify all services are running, and take down the other one and run all tests again.

Hope this helps!

-Nathan

Nathan,

Thank you for the incredibly detailed answer.  I really appreciate it.  I've gone through the steps outlined and we appear to be in a good spot.  Understanding that it's not quite the hard cut-over helped me better understand the situation and make my report.

Glad to be of assistance!

Just chiming in to thank you for this detailed answer.

Hi,

You can trace this using RTMT by navigating to System > Tools > Trace > Trace Logging Central. Collect the logs during the outage window from pub/sub then you can use xtranslator free online tool to decode the logs. You can try to read the logs using native notepad that will be complex reading

From their you can look for phone registrations and see if the phones unregistered from pub and registered to the sub or not. Rest of the phone services are dependent on sub configuration.

You need to keep in mind that not all services can be severed from sub when the pub isn't reachable. Basically all non-user facing features won't be available.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: