Re: Concurrent CiscoProviders - terminals going out of service for reason: rehome to higher priority...

stephan.steiner · ‎07-06-2021

So I need to come up with a solution that allows my app to switch over to another CUCM with the smallest possible delay when a server goes down. There's timers on the

CiscoJtapiProperties

specifically, the heartbeatInterval, and I'm guessing the javaSocketConnectTimeout (I'm guessing it applies to all sockets that the JTAPI lib opens, correct)? While I can considerably speed up switchover from one CUCM (CtiManager) to another playing with these timers, I'm still stuck with a significant service interruption of several dozen seconds - or the amount of time it takes the normal human being to give up on a call).

So as @dstaudt suggested, I implemented my own connectivity check based on sockets. Any my app now keeps a CiscoProvider for every CUCM and only opens them when the CTI Manager is online (currently I assume that if I can connect to the CtiManager Port, I can open a CiscoProvider).

While that works, I'm now facing an interesting behavior. If my app starts and my pub is down, my app opens a connection to the sub, and does its thing. Then the pub comes up. So I open a CiscoProvider for the pub, iterate through devices, register observers - the usual thing. And I see terminals and lines going in service on the CiscoProvider for the pub. Then in the same second, I get a bunch of events telling me that all my terminals went out of service on the sub, and if I look at the reason code, it translates to "rehome to higher priority callmanager".

Here's all the events I'm getting for one particular Terminal (CP_PMGR_1 - of type CTI Port) (srvcucm12 is the pub, srvcucm12s is the sub) and its Address (+41997770001)

06.07 20:00:17 - Terminal CP_PMGR_1 went into service on srvcucm12.nxodev.intra
06.07 20:00:17 - +41997770001 went out of service on terminal CP_PMGR_1 on srvcucm12.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 registered with the callmanager srvcucm12s.nxodev.intra
06.07 20:00:17 - +41997770001 went in service on terminal CP_PMGR_1 on srvcucm12.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 deregistered from the callmanager: srvcucm12s.nxodev.intra, reason: unknown
06.07 20:00:17 - +41997770001 went out of service on terminal CP_PMGR_1 on srvcucm12.nxodev.intra
06.07 20:00:17 - +41997770001 went out of service on terminal CP_PMGR_1 on srvcucm12s.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 went into service on srvcucm12s.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 registered with the callmanager srvcucm12s.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 went out of service on srvcucm12.nxodev.intra. Reason unregistered
06.07 20:00:17 - +41997770001 went in service on terminal CP_PMGR_1 on srvcucm12s.nxodev.intra
06.07 20:00:17 - +41997770001 went in service on terminal CP_PMGR_1 on srvcucm12.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 went into service on srvcucm12.nxodev.intra
06.07 20:00:17 - Terminal CP_PMGR_1 deregistered from the callmanager: srvcucm12.nxodev.intra, reason: unknown
06.07 20:00:17 - Terminal CP_PMGR_1 registered with the callmanager srvcucm12.nxodev.intra

In the end, both Address and Terminal are up on both providers, but what's that with the temporary deregistration? If my app is supposed to be doing something while this happens, this is going to get really messy.

So, I'm wondering how other apps would do that. E.g. UCCX - you wouldn't want to drop calls waiting in a queue or that are in the process of being transferred to an agent if the CTI Manager goes down at the worst possible time. So what am I missing?

When I go the other way, so start with the pub being online and the sub being offline, once the sub comes offline, I'm seeing these events - so nothing is going down

06.07 20:28:27 - Terminal CP_PMGR_1 went into service on srvcucm12s.nxodev.intra
06.07 20:28:27 - +41997770001 went out of service on terminal CP_PMGR_1 on srvcucm12s.nxodev.intra
06.07 20:28:27 - +41997770001 went in service on terminal CP_PMGR_1 on srvcucm12s.nxodev.intra
06.07 20:28:30 - Terminal CP_PMGR_1 registered with the callmanager srvcucm12.nxodev.intra

When I take down the pub with both CiscoProviders up, I get these events, so once again I get devices going down on the CUCM that is still operational

06.07 20:31:55 - +41997770001 went out of service on terminal CP_PMGR_1 on srvcucm12.nxodev.intra
06.07 20:31:55 - Terminal CP_PMGR_1 went out of service on srvcucm12.nxodev.intra. Reason cti manager failure
06.07 20:32:06 - Terminal CP_PMGR_1 went out of service on srvcucm12s.nxodev.intra. Reason unregistered
06.07 20:32:06 - Terminal CP_PMGR_1 deregistered from the callmanager: srvcucm12s.nxodev.intra, reason: unknown
06.07 20:32:06 - +41997770001 went out of service on terminal CP_PMGR_1 on srvcucm12s.nxodev.intra
06.07 20:32:11 - Terminal CP_PMGR_1 went into service on srvcucm12s.nxodev.intra
06.07 20:32:11 - Terminal CP_PMGR_1 registered with the callmanager srvcucm12s.nxodev.intra
06.07 20:32:11 - +41997770001 went in service on terminal CP_PMGR_1 on srvcucm12s.nxodev.intra

So, the device goes down and up again on the sub. To me there's no reason why it should go down on the sub.

dstaudt · ‎07-07-2021

Are you sure all calls to getProvider() are specifying only a single CUCM? If you specify more than one, then JTAPI will be implementing its own failover mechanisms, which may be confusing with what you're doing...

If not, I'm at a loss as to why in your first scenario the original Provider to the sub would ever experience 'rehoming to higher priority CTI Manager'....though looking at your description, you mention the message is 'rehome to higher priority callmanager'...is it possible that these devices are in fact registering to the Pub/Sub based on their device pool? Maybe you're seeing CUCM failover triggered messages instead of CTI Manager triggered messages..? Are you getting ProvOutOfServiceEv on that Provider to the Sub? - if not, that suggests its a CUCM failover/failback situation, not CTI Manager triggered.

stephan.steiner · ‎07-07-2021

Yup, when I create different providers, I provide a single FQDN to getProvider().

I have to ask the guys who built the lab if there's devices that prefer one CUCM over another.

@dstaudt wrote:
Are you sure all calls to getProvider() are specifying only a single CUCM?

Yup, when I do multi-provider, I provide a single FQDN to getProvider()

@dstaudt wrote:
.is it possible that these devices are in fact registering to the Pub/Sub based on their device pool? Maybe you're seeing CUCM failover triggered messages instead of CTI Manager triggered messages..?

Where would I see that? I certainly didn't configure anything to specify a preferred registration target, but then I didn't build the cluster, either.

@dstaudt wrote:
Are you getting ProvOutOfServiceEv on that Provider to the Sub? - if not, that suggests its a CUCM failover/failback situation, not CTI Manager triggered.

No, I'm only getting ProvOutOfServiceEv when I take the corresponding CUCM offline.

dstaudt · ‎07-07-2021

Perhaps something like this is happening?

- Pub is down, Sub is up

- CTIP1 is configured with a device pool causing it to prefer registration to Pub with fallback to Sub - it is currently registered to Sub

- Your Provider1 is connected to Sub, everything is in-service/registered

- You bring up the Pub

- Your app detects that Pub is up, and opens a new Provider2 to the Pub - CTIP1 goes in-service for Provider2

- Meanwhile, CTIP1 now sees that its preferred registration target is available, and since it has no active calls, registration rehomes to Pub

- This causes your Provider1 to see the device go out/in-service briefly while this rehome occurs - this should only occur for idle devices

If I'm off base and your Provider2 is seeing devices go out-of-service due to CTI Manager rehoming to a CTI Manager that wasn't listed in the getProvider() string, then something is wrong and we probably need to dig into logs...

stephan.steiner · ‎07-20-2021

It's actually as you suspected, the devicepool makes the devices switch over. and it seems this can't be turned off.

dstaudt · ‎07-23-2021

Well, you could define two device pools - one for Pub and one for Sub - that don't have any secondary/failover target, and have identical sets of CTI Ports configured with each device pool. This would prevent any CTIP from reregistering due to rehoming, at the cost of management/maintenance overhead (and similar efforts in your app's admin/config/runtime to handle the extra duplication/complexity.)

At the end of the day, this is your app reinventing/reimplementing the CTI/CCM failover mechanisms - which as usual gives you more flexibility at the cost of increased complexity (and the reality of tussling with the native failover mechanisms...)

Concurrent CiscoProviders - terminals going out of service for reason: rehome to higher priority callmanager