cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Bookmark
|
Subscribe
|
3224
Views
0
Helpful
6
Replies

Cisco Unity 8.5: Service Temporarily Unavailable

Hi,

This morning we got this error Service Temporarily Unavailable when you would try to call your voicemail account or a caller would be directed to your voicemail.

The web interface was hanging after the credential login.

the yesterday scheduled backup failed.

I try to find the root cause but dont find much.

Also i m surprised i didnt get any rtmt alert emails flagging the issue. is there a way to creat a custom one?

i resolve the issue by rebooting the server. The failover server then took over until the primary server was up and running but during the swap from failover back to primary it took a good 5 min while we just had an engaged tone when trying to dial unity. The status in the cluster management was "

split brain recovery" not sure yet what it means but looking into it....

Eventualy the primary server came back up and the service is now restored.

I had in the past issue with cucm and services like EM not being available after a backup failure. Would it be the case here? where would i get an answer from logs?

Thanks

1 Accepted Solution

Accepted Solutions

davrojas
Level 3
Level 3

Hello Louis,

It looks like you got several symptoms:

1-Message "Service Temporarily Unavailable"

2-Web interface hanging

3-Failed scheduled backup

4- Implicit replication issues

For the Service Temporarily Unavailable it looks like the services from the Publisher (default Primary) failed therefore would be interesting to find out if any core dumps were generated, you can find this out via CLI running:  utils core active list and then utils core active analyze   if any were generated.

You mentioned that no alarms were generated, however there could still have been some application logs generated during that time, to check them go RTMT (Real Time Monitoring Tool)> SysLog Viewer> Select a node in the top drop down>  Application Logs> Cisco Syslog:

You could also check the System logs as well and confirm if there were any messages like:

EXT3-fs error (device sda6) in ext3   or  Journal errors  maybe even Informix or Selinux.

Maybe for next time, it would have been interesting to figure out in real time which and how many services were [STOPPED] or Service Not Activated, this would have been found out running: utils service list

Tomcat is in charge of the Web interface, so that would posibly been cleared out running:  utils service restart Cisco Tomcat:

However based on the fact that there were several services affected this would have not done much to solve your issue.

For the scheduled backup i would say it's related to the services becoming unavailable, however if you wish to check the traces you would need to gather:

Table 2-3     Cisco Unified Serviceability Traces for Selected Problems

Problem Area

Traces to Set

RTMT Service to Select

Backing up and restoring

Cisco DRF Local
Cisco DRF Master

Cisco DRF Local
Cisco DRF Master

http://www.cisco.com/en/US/docs/voice_ip_comm/connection/8x/troubleshooting/guide/8xcuctsg010.html

As for the replication and Split Brain Recovery here are two documents that can guide you:

"When the connection between the servers is restored, the status of the servers temporarily changes to Split Brain Recovery while the data is replicated between the servers and MWI settings are coordinated. When the recovery process is complete, the publisher server has Primary status and the other server has Secondary status."

http://www.cisco.com/en/US/docs/voice_ip_comm/connection/7x/cluster_administration/guide/7xcuccag020.html

"Going into SBR should not cause calls to drop.  While SBR is running, new messages left are not delivered until SBR is complete (because the MTA process which delivers messages is stopped as part of the SBR process).

If calls are being dropped, there is most likely something going on between each individual Connection server and CUCM (assuming that's what it's integrated with).  SBR does not stop any service required to accept a new message."

https://supportforums.cisco.com/thread/2131696

I would say that the reboot was definitely the best way to go and if the services would have not started then runt the recovery disk and as far as for the Split Brain Recovery this was expected.

Finally for the alert go to RTMT (Real Time Monitoring Tool)> Alert Central> System> You can choose any applicable however in this case would be CriticalServiceDown> Right click> Set Alert Properties and go through the steps, see screenshots below:

-Best Regards,

-p-e-c-k-david

View solution in original post

6 Replies 6

davrojas
Level 3
Level 3

Hello Louis,

It looks like you got several symptoms:

1-Message "Service Temporarily Unavailable"

2-Web interface hanging

3-Failed scheduled backup

4- Implicit replication issues

For the Service Temporarily Unavailable it looks like the services from the Publisher (default Primary) failed therefore would be interesting to find out if any core dumps were generated, you can find this out via CLI running:  utils core active list and then utils core active analyze   if any were generated.

You mentioned that no alarms were generated, however there could still have been some application logs generated during that time, to check them go RTMT (Real Time Monitoring Tool)> SysLog Viewer> Select a node in the top drop down>  Application Logs> Cisco Syslog:

You could also check the System logs as well and confirm if there were any messages like:

EXT3-fs error (device sda6) in ext3   or  Journal errors  maybe even Informix or Selinux.

Maybe for next time, it would have been interesting to figure out in real time which and how many services were [STOPPED] or Service Not Activated, this would have been found out running: utils service list

Tomcat is in charge of the Web interface, so that would posibly been cleared out running:  utils service restart Cisco Tomcat:

However based on the fact that there were several services affected this would have not done much to solve your issue.

For the scheduled backup i would say it's related to the services becoming unavailable, however if you wish to check the traces you would need to gather:

Table 2-3     Cisco Unified Serviceability Traces for Selected Problems

Problem Area

Traces to Set

RTMT Service to Select

Backing up and restoring

Cisco DRF Local
Cisco DRF Master

Cisco DRF Local
Cisco DRF Master

http://www.cisco.com/en/US/docs/voice_ip_comm/connection/8x/troubleshooting/guide/8xcuctsg010.html

As for the replication and Split Brain Recovery here are two documents that can guide you:

"When the connection between the servers is restored, the status of the servers temporarily changes to Split Brain Recovery while the data is replicated between the servers and MWI settings are coordinated. When the recovery process is complete, the publisher server has Primary status and the other server has Secondary status."

http://www.cisco.com/en/US/docs/voice_ip_comm/connection/7x/cluster_administration/guide/7xcuccag020.html

"Going into SBR should not cause calls to drop.  While SBR is running, new messages left are not delivered until SBR is complete (because the MTA process which delivers messages is stopped as part of the SBR process).

If calls are being dropped, there is most likely something going on between each individual Connection server and CUCM (assuming that's what it's integrated with).  SBR does not stop any service required to accept a new message."

https://supportforums.cisco.com/thread/2131696

I would say that the reboot was definitely the best way to go and if the services would have not started then runt the recovery disk and as far as for the Split Brain Recovery this was expected.

Finally for the alert go to RTMT (Real Time Monitoring Tool)> Alert Central> System> You can choose any applicable however in this case would be CriticalServiceDown> Right click> Set Alert Properties and go through the steps, see screenshots below:

-Best Regards,

-p-e-c-k-david

Hi David,

Thanks a lot for your answer. really appreciate your input.

no core dump found. i went on rtmt - syslog and see below a screenshot. i couldnt find a a way to download the log in txt . it seems the last successfull unity user access was at 7:54 then everyone heard the failsafe conversion which i assume is the service is tempo unavailable.i cant see any filesystem ext3 error. then i can see the services being stopped (i guess due to me reloading the box)

i did have a look earlier today in the DRF. i can see on the logs it stopped when doing

CONNECTION_MESSAGES_UNITYMBXDB1_MESSAGES but no log for

CONNECTION_MESSAGES_UNITYMBXDB1_MESSAGES :-( i can see the file on the ftp server with 0KB.

thanks for the clarification about SBR, when i saw that i first thought a db issue but glad to see it s a normal behaviour.

i will look at the alert a bit later on.

in the future, i will try to get the failover going primary so i dont reload the primary and can investigate while it s hot and not affecting the business... unfortunately it was in the morning, still not fully awake and my main priority was to restore the service as quick as possible :-)

thanks Louis

Hi again,

had a look on the rtmt alert, it was set already... i tend to say, no services were down then ?

Hi Louis,

To extract the log in txt format you would double click on the log and then on the window that pops-up you click on the copy button, this is like having done crtl+c and then you can just crtl+v anywhere:

Now if everything was configured properly I would say that the assumption of no services down has it's side-part:

Just to quick review you do have dns configured correct? To check run: show network eth0

Note

To configure RTMT to send alerts via e-mail, you must configure DNS. For information on configuring the primary and secondary DNS IP addresses and the domain name in Cisco Unified Communications Manager Server Configuration, see the "DHCP Server Configuration" chapter in the Cisco Unified Communications Manager Administration Guide.

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/8_0_1/rtmt/rtalert.html

Also by default the severity is set to critical so any other states would have not been reported, you can always change this but only to match one option:

CriticalServiceDown

The CriticalServiceDown alert gets generated when the service status equals down (not for other states).

From the screenshot you sent the service looks to be the CuCsMgr, therefore what might have happened is that ports unregistered during the 'Service Temporarily Unavailable' event, not sure if this would be sympton or condition as we know that the GUI was also unresponsive, leaving us at least to suspect about two services being down (Cisco Tomcat and CuCsMgr).

So based on those premises i would also try to correlate if any network events occured during the period of time the users were last able to log in around 7:54, just to have into account all posibilities and not miss something at plain sight, still being said of course this seems unlikely.

In regards to the DRF failing I have seen sometimes where the fault  [if needed to point at anyone =) ] is actually fromt he SFTP software being used, in those cases those logs come in handy as well, for example insufficient space, closed sockets, lost connectivity and so forth.

Also be aware there are some caveats, however due to the nature of the problem it would be unlikely:

CSCtd29993    DRF Master [STOPPED] by default on A/A Subscriber nodes

https://tools.cisco.com/bugsearch/bug/CSCtd29993/?reffering_site=dumpcr

CSCuj35677    DRS backup failing on pub due to hostname changed.

https://tools.cisco.com/bugsearch/bug/CSCuj35677/?reffering_site=dumpcr

This one i have known to be a common issue:

CSCug68527    Backup fails with archive larger than 20 GB

https://tools.cisco.com/bugsearch/bug/CSCug68527/?reffering_site=dumpcr

-Best Regards

-d-a-v-i-d-peck

Good morning David,

for the log, i see i can export line by line but as there are many lines i wanted to export to all lot. is it possible?

DNS are fine, i do get email alerts like when a user used wrong password etc.. After reloading the box i started to get email alerts as the services were going down. the severity is set to critical So yes, I believe no services were affected.

I will try to find some logs about the ports.

On the network side, yes I believe something went down. I did have issues login on yesterday morning but I cant find any evidence for it and I hate jumping on the conclusion "Network issue, not voice..."

A network issue would make sense,

-Backup at 22:00 on 4th failed for not appareant reason.

-The GUI was unstable in the morning

-Couldnt access unity mailbox

maybe restarting the box hasnt fixed anything I just had to wait for the network to come back up

regards

Louis

Hi Louis,

Yes, you can save it to a file and then open it with notepad or notepad++, basically you would click on the Save button at the bottom and you will get a pop-up window to select the location:

savealllines.png

I know the feeling about jumping to a network conclusion, i was looking around but do not see any other place to look than app or system logs to confirm any sort of outage.

-Best Regards,

d-a-v-i-d-peck