VCS Clustering failing from time to time Error message Failed - This peer either has a different lis...

mguguvcevski · ‎04-12-2013

The following error message appears after some time in a cluster with only two VCS Controls:

Failed - This peer either has a different list of peers or it has a different software version installed

for the remote peer.

A restart of the slave solves the issue.

Has anyone come across this ?

gubadman · ‎04-12-2013

That's a bit odd. What version of software are you running? Also what is the round trip delay between each peer, and every other peer?

Thanks,

Guy

ahmashar · ‎04-12-2013

are they geographically apart? the roundtrip delay is 30 ms, above that it would fail.

do you have the same NTP server configured on both of them?

are they running the same software?

are they on the sam VLAN? are they on the same subnet? if not, have you turned off SPI on your Router/firewall, if you have any in between?

basically you need to provide more info, there is a lot questions need to be answered.

mguguvcevski · ‎04-12-2013

Hello Ahmad,

Yes they are geographically apart - the RTT is around 30 ms

15 packets transmitted, 15 received, 0% packet loss, time 14016ms

rtt min/avg/max/mdev = 33.326/33.755/35.132/0.476 ms

The NTP server is not the same, but I will change that, if required

Yes, they are running the same sw

The firewalls are Cisco ASA and inspect is disabled

Another thing is the following:

As an addition to all the ports listed in App. 3 of the clustering guide for the VCS, I have come across an issue with communication between the cluster peers on high ports > 40000.

They seem to be opening them dynamically between each other.

Port pairs I can distinguish on the ASA are for example:

45600 / 43173

42981 / 46620

48399 / 48424

41386 / 43234

44599 / 45478

46122 / 47370

49243 / 40433

After two restarts, I had to open port 46854, and the replication finally occured.

The major setbacks are:

1. What it will happen when the master or some of the peers get restarted or rebooted ?

2. How come there is no configuration documentation on these ports ?

Cheers

ahmashar · ‎04-12-2013

did you use ping ipaddressOfOtherPeer ? if yes, then use this: ping -l 4000 ipaddressOfOtherPeer ? this tests for 4000 large packets which is more align with replication packet size between VCS peers. if the round trip delay is above 30 ms, then you need to contact your network admin to fix that, otherwise we continue with investigation.

regards, Ahmad

TMS screenshots that he has valid service contract until 25/11/2013

mguguvcevski · ‎04-12-2013

Hello Ahmad,

I tested with 4000 (+28 the size of the ping packet header) and here is the result:

19 packets transmitted, 19 received, 0% packet loss, time 18023ms

rtt min/avg/max/mdev = 34.262/34.372/34.699/0.265 ms

It is true slightly above 30 ms, but constant at 34 ms.

Is the VCS that sensitive to the trip time ?

And why it does not fix the replication after it returns the error, if it is capable of doing it after a restart ?

gubadman · ‎04-12-2013

30ms is an absolute maximum, 34ms is over 110% of that.

The cluster need to be sorted out so even at peek times it's within 30ms.

It can recover, but it takes time, and if it keeps drifting outside 30ms it wont have chance to.

Thanks,

Guy

ahmashar · ‎04-12-2013

For clustering must be below 30 ms (hard set).

VCS (H323) is synchronizing via NTP and very sensitive to timing.

mguguvcevski · ‎04-12-2013

But then, how come the other cluster (between the VCS Expressways) that also has RTT of about 34 ms average works fine ?

30 packets transmitted, 30 received, 0% packet loss, time 29032ms

rtt min/avg/max/mdev = 34.237/34.582/37.126/0.690 ms

ahmashar · ‎04-12-2013

I would be doutful that they would take the same route. did you also trace route between VCSe's? are they taking the same route as VCSC?

are VCSe peers also geographically apart? no backend connection for repliation?

mguguvcevski · ‎04-16-2013

It appears so, that the clustering communication between the peers of a cluster uses far more ports than the documented ones in the clustering setup guide (H.323, IPsec, IKE). The issue was finally resolved after tracking the drops on the ASA, finding out a bunch of high random numbered ports and finally opening ip any any for the clustering peers.

Cheers,

M.

ahmashar · ‎04-16-2013

This is from the deployment document to turn off ALG/SPI/any other packet inspection:

it is highly recommended to disable SIP and H.323 ALGs on routers/firewalls carrying network traffic to or from a VCS Expressway, as, when enabled this is frequently found to negatively affect the built-in firewall/NAT traversal functionality of the

Expressway itself.

ma.romero · ‎04-22-2013

Hello,

I have the same issue but only with ESP (SPI). What ports are needed to remove from ESP(SPI) or protocol inspection? I mean:

- SIP: 5060

- H323: 1719 y 1720

- SSH: 22

- ISAKMP: 500

Any more? The high ports from 40000 to 49000 too? And media ports of SIP from 20000 to 29999? And H.323 Media ports too?

Thanks in advance.

Best regards.

mguguvcevski · ‎04-22-2013

Hi,

Due to the dinamism of it all I resorted to opening gt 1024 between the VCS Cluster member IPs.

Best regards

ma.romero · ‎04-23-2013

Hi,

Every ports from 1024 to 65535 between both VCSe, TCP and UDP protocols?

Thanks in advance.

Best regards.

VCS Clustering failing from time to time Error message Failed - This peer either has a different list of peers or it has a different software version installed