Questions to those using two Client Managers

Carolanne Fougerat · ‎09-27-2013

Hi, wanted to know for those Tidal clients that have upgraded to TES 6.1 and using two client managers:

Do you have each client managers (CMs) on different data centers for DR? Are your tidal servers installed across data centers?
Have you noticed any discrepancies on behavior between the two CMs?
Have you noticed synch discrepancies on some rare instances? For example, one CM not showing the same number of schedules or sessions or job status as another?
Have you noticed more sessions from a single signed on user occuring frequently (by querying the usersession table on TES schema)?
Do you get the annoying RPC error (that tell you to reconnect- but you ignore and select No to anyway) frequently?
Were some of the issues above alleviated by some configuration change in your Load balancer? If so, what?
Is you load balancer session timeout the same or greater than the TES session timeout (30 min I think) - ofcourse I never really knew if this 30 mins worked or what I was supposed to see when the session timed out due to inactivity. Have you tried changing this default?

binduhima · ‎10-03-2013

I'm also having the same issues. Would like to know about the DR set up.

We are also seeing a lot of RPC error messages.

Carolanne Fougerat · ‎10-03-2013

I logged a case about multi data center DR - and I was told that this version and earlier ones assume that failover architecture is all on the same data center. That 6.2 may address multi datacenter architecture.

We have been running on two data center since 5.3 (making sure we had redundant components in each data center) so I was suprised to learn this - I was not the Tidal admin when we instituted our failover environment in 5.3 so I don't have the background history.

In the 5.3 environment with just db amd master (and no client manager to contend with)- we do not notice a difference in performance when master was not on same data center as our database. So we are hoping it won't be an issue in 6.1. But every client is different and unique so I can't really say what works or didn't work for us will be the same for you.

Because our TES 6.1 DEV architecture is only using one data center and we are already seeing performance issues from time to time but cannot figure out the source, it has become more important for us to know if multi datacenter will aggravate issue even more. Also looking into trying out tools like appdynamics or new relic to help us determine where the slowness is consistently stemming from.

The RPC errors continue to be a head scratcher - dunno if our Load balancer (GSS) is the culprit, or simply having two CMs, or the combination of both CM and GSS or the fact that we're using VMs or that we're using clustered database or that we're under sized etc- the variables are endless. Will try to test that more formally in the next few weeks.

Tracy Donmoyer · ‎10-03-2013

I have two comments about multi-data center DR.

I never realized the Fault Tolerant option assumed both masters were in the same data center. Kind of raises the question, what's the point? I've always struggled with the value Fault Tolerance adds to the environment since it only fails over the master. The database and clients require manual intervention.
The recommendation from Tidal for v6 is that all components (Master, Client Manager, Databases) be on the same network segment which implies the same data center. We haven't tested performance when the Master and Client Manager are in physically different locations. If anyone has tested this I would like to know the effect it had on performance.

Carolanne Fougerat · ‎10-03-2013

I know what you mean. The current TES fault tolerance architecture is mainly for server component failure apparently. Since our datacenters are in the same city a few miles apart and we have a robust pipe between the two of them we've never had issues with our other systems and on 5.3. I can imagine this not workable for cross country datacenters (which is the DR best practice).

We have just finished setting up our PRD environment where I am doing practice upgrade passes on - and that is where I will do DR testing. Taken pains to disable all jobs and agents, since it is mainly the CM latency I am concerned with. We also have Oracle cluster on one DC and a standby cluster on another. I will compare difference in latency when all components are on same DC vs when they are not. We also need to test for when a DC becomes unreachable. Systems with redundant architecture should continue to function on one DC. So I put FM where backup master is, and put primary master by itself - since backup master cannot come up without FM, but Primary can if you take FT OFF (tesm command option). We plan on having Cm1 on DC1 and Cm2 on DC2. So I will see how it goes next week. Actually I have it up like this now, just haven't done the formal stopwatch comparisons yet. If it is really bad, I will then have to redo plan so that everything is on one datacenter most of the time and when we swtchover to the other datacenter, we switch over everything. Will also mean that I can only have one CM active - which was not something I orignially planned on. This impacts database and OS patching strategy.

Our performance issues in 6.1 are mainly with navigation (though during load test we notice queueing on master server ). As mentioned before in 5.3 we've already used the two masters in separate datacenter configuration with no issues. Hopefully 6.1 is not too different - well, actually hoping it will be better.

Carolanne Fougerat · ‎10-16-2013

Ah, forgot to add - to Tracy's question about FT architecture and what the value is if it is only for 1 datacenter. Even though we use it for two separate datacenters, the value for us is that during our monthly server/OS patching, maintenance, we are still able to maintain a 24 x 7 availability for the master because of the FT feature. We just make sure that FM and both masters are not patched at the same time. It is even more important when we move to Linux where patching downtime takes an entire hour.

But I definitely agree that fault tolerance value extended with multi datacenter architecture for DR purposes. Ofcourse, even if Tidal has multi data center DR, if the most apps running the jobs don't then it's moot - that is why most our mission critical apps have to have full redundancy across datacenters.

jpforums2 · ‎10-22-2013

Hi Carolanne, We are in the middle of upgrading TES from 5.3.1 to 6.1. We have set it up in lower environments, and are testing the upgrade. We also have similar set up (PM, BM, FM, 2 CMs, load balancer). I’m still looking into options on how to set up the DR. Have you installed a separate CM for DR purposes? I’m interested to know about your DR set up. Can you share your experiences and recommendations.

Thanks,
John

Carolanne Fougerat · ‎10-24-2013

We are still in the middle of testing the two CMs in two datacenter setup. Again since this is not a supported configuration I really can't give an advice other than what I have already shared. I am sharing my experience, hoping that other who have figured this out ahead of us would share too.

Once I have something more conclusive, I will share.