Solved: CUCM WAN REDUNDANCY

a.gooding · ‎11-01-2012

Guys,

just wanted to get some general input on a sceanrio

we have had two instances of the following and we are wondering if there is something we are missing

QTY 1 UCS C210 at "HQ" and the other QTY 1 C210 at "DR Site

Sites linked by 10 to 15 MB links (generally speaking more on the 10Meg side)

HQ Running PUB CUCM , SUB CUC and UCCX and DR only running CUCM and CUC for now

All versions at latest 8.6

CIMC version factory (i noted a bug with the FAN alert on 1,42 so i just upgraded to 1.43)

BIOS version Factory

ISSUE

We are having some intermittent break in communcation between the two systems but each customer is saying that thier LINKS are fine. ill go with the customer as im not actuallly seeing any drops or physical drops on thier links. thier bandwidth varies but only slightly and does NOT drop lower that 6 MB. applications generally takes up 1.5 to 2.5 MB but im not 100% sure of this at peak periods per month

The Break causing the MAIN CUCM to ";lock up" and i cant access the CLI or WEBPAGE. a restart works. if this is happening the SUBSCRIBERS.

DBREPLICATION is also broken and i have to stop, drop and reset to bring them back up again.

SUBS may also be affected and just recently we had to do a reboot and it could no longer boot into the CUCM. i used the recover disk to temporarily recover.

Some other Symptoms

1. some phones may de-register and stay that way

2. i cannot PING the VM again

3. i cannot use vpshere client to reboot the VM as it says in progress and jsut stays there

I do admit that generally we will upgrade all systems properly however we have not been upgrading the CIMC and BIOS previous to this issue.

Ohter noteable configs

using the two NICS on boad - one CIMC and other VM SERVER

Other four nics combined as another vsiwtch and this is used as a pool for the VMs

has anyone encountered anything similar to this?

thanks in advance for the replies.

Robert Thomas · ‎11-02-2012

Sounds like might be an issue with interrupt remapping, I wrote a document on that, however I'm not sure the FW that is fixed in, we would have to check.

I agree the CLI should not lock up with a SDL link failure.

I would not go with the customer on their networking either, just set a couple of SPANs on the nexus, and configure wireshark to do a rolling capture on both ends. You will have packet level data when the drop happens for you to actually confirm or discard network on the drops.

Sent from Cisco Technical Support iPhone App

View solution in original post

Aaron Harrison · ‎11-02-2012

Hi

WAN problems should not cause server lock ups (assuming you are not attempting to access the CLI/webpage over the WAN during congestion).

It sounds more like you have some sort of server issue.

I would check:

1) That you have used the latest Host Update CD to update all the UCS firmware to a consistent/recent level. Don't update the components individually.

2) That you have patched VMware to the latest available level. This would include various drivers/software patches that might help

3) Ensure that you have followed the installation steps properly - i.e. you have installed VMware tools on all VMs, and have also set the LRO disabled settings (http://docwiki.cisco.com/wiki/Disable_LRO)

Regards

Aaron

Aaron Please remember to rate helpful posts to identify useful responses, and mark 'Answered' if appropriate!

Robert Thomas · ‎11-02-2012

Sounds like might be an issue with interrupt remapping, I wrote a document on that, however I'm not sure the FW that is fixed in, we would have to check.

I agree the CLI should not lock up with a SDL link failure.

I would not go with the customer on their networking either, just set a couple of SPANs on the nexus, and configure wireshark to do a rolling capture on both ends. You will have packet level data when the drop happens for you to actually confirm or discard network on the drops.

Sent from Cisco Technical Support iPhone App

a.gooding · ‎11-02-2012

Just a note. i was able to bring everything back up however i wasnt able to ping becuase although the VMs looked like they were up they actually were NOT. im administering these systems remotely so i thought i may have been something on my end. i decided to setup a local system and i used that but it was the same issue. so , i tried powering down on VM and it said in progress and stayed there.

eventually i rebooted PUB and SUB physical servers and the PUB systems came back up. all my SUB SYSTEMS gave an INODE orphan issue and would not boot so i repaired with the recovery. they all came back up, DB rep was checked in CLI as well and all was good.

ill google to find your document on remapping Robert but would you be able to shoot me a link just in case.

Also, this is VM 5, i do have one other system setup with VM 4 and although its only a few months in, we havent had any issues with that. (Note this customer has large Bandwith and Very Stable links)

I think ill open a TAC jsut to get another eye on it since im thinking we still might be missing something here.Just a NOTE thatwe are NOT using any fabric interconnect or top of the rack or anything of that sort. its jsut 2 UCS C210 M2 with 10 HDD connected to 3750 switches on each side. we are using the 2 onboard for management and VM Server and the other 4 pooled togehter to service the 4 VMs.

Currently we have the following setup on the FIRST

CUCM

CUC

CUBAC

On the second

CUCM - SUB

CUC - SUB (Currently shut down becuase we are awaiting HA license)

UCCX - Currently shut down awaiting License)

thanks in advance

a.gooding · ‎11-02-2012

Only now getting around to read your write up

https://supportforums.cisco.com/docs/DOC-23667

and it does seem to point in that direction. i did upgrade CIMC to 1.43 and based on a comment to your write up it was fixed after 1.42.

i still have the BIOS to sort out so ill ensure that this is done.

i defintely will post if this IS NOT the final fix.

thanks once agan.

a.gooding · ‎11-24-2012

Guys,

just posting a final upadate just in case. i never got around to updating the BIOS and the same thing happened. i only upgraded the CIMC.

in any event, just go ahead an upgrade both and all will be fine.

ISSUES

1. vm client non responsive

2. applications crashing and failing to start for no apparent reason

3. general failure of all systems - not PUB/SUB setups are what i noticed it with. the standalone apps running of the system like CUBAC and a SINGLE UCCX seemed to not be affected.

FIX

1. UPGRADE to the latest firmware for the CIMC and BIOS(we learnt this in Network + right) - you can download a single package file which you can boot to an iso that allows for everything to be upgraded if required

thanks anyway