Solved: How to restore unity connection publisher?

srosenthal · ‎03-15-2010

I found a typo in the name of the publisher so I changed it in the OS Administration. After the server rebooted I can no longer log into the Administration of the unit. I get database error messages when I try to log in. The subscriber is operating as primary right now.

So I guess I need to reload the machine completely. My question is once I start the install process and bring this machine back on line, will it take all of the database information from the subscriber or is there something I need to do during the install to make that happen. I really do not want to recreate everything that is there even though these units have not gone into production yet.

There are also no backups done since it is not in production. I know that during the install I get asked if this is the first server in the cluster, do I answer no even though eventually I want this server to be the publisher again?

Thanx, any help is greatly appreciated.

Seth

David Hailey · ‎03-17-2010

Well, if you want to go the reload route then this is the way to do it:

1) Shutdown the Subscriber.

2) Rebuild the Publisher via DVD which will overwrite the hard drive. The DVD comes with the server. Do not try to use the "Recovery" CD, just the straight up application CD - CUCM version (whatever you had shipped).

3) Before you install the Subscriber, make sure you verify NTP sync on the Pub (critical).

4) Once you have the Publisher rebuilt, add the Subscriber in CU Admin as part of the cluster just as you normally would. Use the IP address of the server (recommended).

5) Rebuild the Subscriber from DVD as you did the Publisher. It should be added as the second node in the cluster so make sure you have DNS and IP connectivity to/from all resources needed BEFORE STEP 2 ever occurs.

6) Once both servers are up and running, check CUC Serviceability and look at the cluster status.

7) Then, go to the CLI on both servers and run "show tech network hosts". The /etc/hosts file should have a loopback address and both cluster servers included in the file.

8) Make a test user on the Pub, verify it replicates to the Sub. Delete it from the Sub, make sure it deletes from the Pub.

9) Test failover then test failback by going to CUC Serviceability and putting the Publisher back as active.

10) You can also look at the cluster status on both servers from the CLI - show cuc cluster status is the command to do that.

11) If all that checks out, get your license files loaded and do your configurations.

12) Make sure you set up DRS backups on a schedule (for both servers) and perform an initial manual DRS backup of both as well.

13) I would also recommend that you upgrade both systems to the latest SU for 7.1.3 which is 7.1(3b)SU2. You can download it from CCO.

As for MWI, there is a specific service that enables MWI to work properly. If that service didn't start on the Sub for whatever reason when it failed over, then that may be the cause. Get the servers rebuilt and let's go from there. Sound good?

Hailey

Please rate helpful posts!

View solution in original post

David Hailey · ‎03-15-2010

Ok - you can't log into the "Administration" so I'm assuming you mean the CU Admin interface. Can you log into the CLI on the box? If so, you may be able to mend what's broken. Let me know if you've tried getting into the CLI or not.

srosenthal · ‎03-15-2010

yes I can putty into the unit.

Thanx

David Hailey · ‎03-15-2010

SSH into the CLI. You typically do this in the reverse order by changing host name via CLI and then in OS administration.

First take a look at "show network cluster" and see if the hostname on the CLI matches the new hostname that you set in OS admin. If it does not, then you'll need to look at the "set" commands. You can change the hostname via CLI using "set hostname cluster publisher " where matches what you set in OS admin.

If, by chance the name in the CLI does match up then I have to ask if you rebooted the publisher after changing the hostname? And, if so - what does the status of DB replication look like for the 2 cluster nodes in RTMT (or via command line)? You'll need to look at the Replicate State to gather that info.

Hailey

Please rate helpful posts!

srosenthal · ‎03-15-2010

The name is correct in the cli - here is the output

admin:show network cluster
172.30.56.203 tu-voip-unit1.henrico.lib.va.us tu-voip-unit1 Publisher
172.30.56.204 tu-voip-unity2.henrico.lib.va.us tu-voip-unity2 Subscriber

RTMT is not connecting to the publisher. Here is what is on the web page for the subscriber

Communication is not functioning correctly between the servers in the Cisco Unity Connection cluster. To review server status for the cluster, go to the Tools > Cluster Management page of Cisco Unity Connection Serviceability. The Cisco Unity Connection cluster subscriber server has changed to Primary Status (failover has occurred). To review server status for the cluster, go to the Tools > Cluster Management page of Cisco Unity Connection Serviceability.

I do not know the cli command to view the replication status.

Also, the server automatically rebooted when I changed the name and then rebooted again when I tried to change it back.

Seth

David Hailey · ‎03-15-2010

OK. So, I have some CLI info to provide but first I wanted to inquire about the following:

I noticed that the Subscriber is tu-voip-unity2; however, the Publisher is tu-voip-unit1 (notice the missing "y" in the Publisher host name). Maybe this was intentional - I can't say as I don't know your environment. But, if you are relying on DNS for communication and you misnamed the host on the CUC Publisher then this would be a cause for problems.

On to the CLI:

I know that on CUCM that DB replication is stopped after you change the hostname of the Publisher. So, assuming this is still a non-production system (I think you indicated that in your first post) here's some things I'd look at.

On the subscriber, run "show tech network hosts" from CLI and see what it's host table looks like. You should also be able to set the new hostname of the Publisher via the CLI as well. On the Subscriber, you'd need to run "set network cluster publisher hostname " where hostname is the correct hostname of the Publisher.

To check replication, you need to run the following command on both servers:

show perf query class "Number of Replicates Created and State of Replication"

You will see a value on both servers of 0 - 4. A value of 2 indicates replication is good. 0 is not started, 1 indicates an issue with replication counts, 3 means replication is bad, and 4 means replication did not succeed.

Replication is likely to be hosed up (if functioning at all). However, if you can get the 2 servers to be cognizant of each other again then you could attempt to reset replication from the Publisher CLI using "utils dbreplication reset all".

Hailey

Please rate helpful posts!

srosenthal · ‎03-15-2010

Well the naming is the root of my problem. It was tu-voip-unit1 and I changed it to tu-voip-unity1. Then after the reboot it came back up with licensing errors so I thought I could just change it back to tu-voip-unit1 and that is where I am at now.

I would like it to be tu-voip-unity1 if possible as that is what I was asked to name it.

Here is the output from show tech network hosts

127.0.0.1 localhost
::1 localhost
172.30.56.203 TU-VOIP-UNIT1.henrico.lib.va.us TU-VOIP-UNIT1
172.30.56.204 TU-VOIP-UNITY2.henrico.lib.va.us TU-VOIP-UNITY2

I changed the name on the pub to tu-voip-unity1. Does case matter?

I also changed to name on the subscriber to match that using the command you listed.

Here is the output of the command you gave.

admin:show perf query class "Number of Replicates Created and State of Replication"
==>query class :

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
ReplicateCount -> Number of Replicates Created = 427
ReplicateCount -> Replicate_State = 2

The pub is rebooting so I will update when its done and was so when I got the above output.

Seth

David Hailey · ‎03-15-2010

Case doesn't matter. Let me know what happens when the cluster reboots.

srosenthal · ‎03-15-2010

Ok, the pub is back up and I can web into it, a great improvement from before.

Here is what it says after login.

The Cisco Unity Connection cluster subscriber server has changed to Primary Status (failover has occurred). To review server status for the cluster, go to the Tools > Cluster Management page of Cisco Unity Connection Serviceability.

So I guess the question is what next and how do I make the pub primary again?

Dude, thank you so much for all the help. I owe you a cold one!

Here is the output from the pub -

admin:show perf query class "Number of Replicates Created and State of Replication"
==>query class :

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
ReplicateCount -> Number of Replicates Created = 0
ReplicateCount -> Replicate_State = 0

Here is from the subscriber -

admin:show perf query class "Number of Replicates Created and State of Replication"
==>query class :

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
ReplicateCount -> Number of Replicates Created = 427
ReplicateCount -> Replicate_State = 2

Seth

David Hailey · ‎03-15-2010

No problem, brother. By the name of the servers, I don't think it would be impossible to buy me a cold one at some point...not too far away. So, go to Cisco Unity Connection Serviceability > Tools > Cluster Management and you'll see controls on how to change the server status between the cluster servers. BEFORE you do that, you should log in to the CLI of BOTH servers and run the DB replication command I sent you before. Make sure the replicate state is 2 on both servers. If so, swap primary status and then do some shakeout tests (add a user on one server, make sure it replicates to the other then delete it from the second server and make sure it's deleted from the first, etc). You know the drill from there.

Let me know what comes of it.

Hailey

Please rate helpful posts!

srosenthal · ‎03-15-2010

Ok,

I did the command on both and the subscriber told me that it had to be done from the pub.

Here is the pub output.

admin:utils dbreplication reset all
This command will try to start Replication reset and will return in 1-2 minutes.
Background repair of replication will continue after that for 1 hour.
Please watch RTMT replication state. It should go from 0 to 2. When all subs
have an RTMT Replicate State of 2, replication is complete.
If Sub replication state becomes 4 or 1, there is an error in replication setup.
Monitor the RTMT counters on all subs to determine when replication is complete.
Error details if found will be listed below
OK [172.30.56.204]
admin:
admin:show perf query class "Number of Replicates Created and State of Replication"
==>query class :

- Perf class (Number of Replicates Created and State of Replication) has instances and values:
ReplicateCount -> Number of Replicates Created = 0
ReplicateCount -> Replicate_State = 0

I will let it adjust over night and check the show perf command tomorrow. As per your instructions I will wait until it shows 2 before making the pub primary.

Seth

David Hailey · ‎03-15-2010

Ah, I actually was referring to the command to check the DB replication state. You are resetting the cluster replication which is why it has to be done from the pub...but that's OK, it's probably not a bad idea after all the name changes and etc. Typically, you should do that after resetting the Publisher's host name anyway (at least with CUCM, that's the case). But yeah, let it run and see that everything returns OK. If not, then we can probably deal with that too.

Hailey

Please rate all helpful posts!

srosenthal · ‎03-16-2010

Ok, the pub still shows 0 for the replicate state.

I am not sure which other command you are talking about as I looked back and did not see it. Could be a combination of either too late at night or too early this morning.

I eagerly await your next command!

David Hailey · ‎03-16-2010

OK. So, given the scenario here is what I'd do:

Manually fail back to the Publisher to make it primary. Then use the following procedures in the attached document (Page 2-2 - Manually Change a Server from Secondary to Primary Status) to check that everything there is taken care of.

From there, I'd run the following command on each server to see if you get the same output for replication status (but opposite of what you saw last night, i.e. - see if you get 2 on the Pub and 0 on the Sub. If so, this may be expected due to the HA setup of the cluster and how the SRM behaves when failover occurs). The command to check DB replication status is: show perf query class "Number of Replicates Created and State of Replication". Run it on both servers.

If switching back to the primary doesn't initiate replication on that node then I would assume there is still an underlying problem. The replication status of that node should be 2. There are 2 options as I see it from here. You could attempt to restart the replication like you did last night (worth a shot) OR you can fail back to the Subscriber and go the TAC case route. If the Publisher has been hosed up, you can't just build a new server and it tell it the Subscriber is the Publisher. You will have to rebuild both nodes from scratch. Luckily for you, these aren't in production yet so you have that opportunity to do so without impacting anyone if it absolutely needs to be done.

Now, let's assume the Publisher is good and replication status is 2 but Subscriber is initially status of 0. You'll need to go the CU Admin on the Publisher and create a new user w/mailbox. Then login to the CU Admin of the Subscriber and verify that the new user shows up there as well. While you're in the Subscriber, delete that new user and then verify that it is removed from the CU Admin on the Pub. You should run RTMT the whole time. Use the Perf counters on RTMT to look at the Replicate State and Replicates Created counter visually during this test. When you create the new user w/mailbox on the Publisher, if the cluster is behaving normally then the status of replication for BOTH nodes should be 2. In other words, the Subscriber should go from 0 to 2. If it does not, then you are back to where you should probably go the TAC case route. While it would really suck to have to rebuild both nodes from scratch, you'd need someone from TAC to verify what is wrong within the cluster and if it can be fixed with the servers as-built. If it cannot, I'd assume your obligation to the client is to deliver a healthy, working system and the only way to be positive of that after these issues may be to rebuild from scratch.

If what I've told you helps you out and you think the cluster is functioning normally again, you need to test the hell out of it. I'd test manual failovers and failbacks, introduced failovers (i.e., shutdown the Primary), test calls in every scenario, and monitor via RTMT as you go.

Let me know what's up.

Hailey

Please rate helpful posts!

srosenthal · ‎03-16-2010

Hailey,

You said - If so, this may be expected due to the HA setup of the cluster and how the SRM behaves when failover occurs).

I am unfamiliar with HA but this came up in discussion with a co-worker today. I think that my licensing is so that only one server is active at a time and not load shared between both. He said something about that my ports should show up as split half on one server and half on the other. Is this correct?

Seth