Looking for Load Testing tips

Carolanne Fougerat · ‎11-21-2013

Hello,

(Apologize ahead of time for lenghtiness - getting desperate here)

Curious to know how many have performed load testing of their environment to gauge system performance and stability. We have done a series of them and we definitely see performance issues with more people in the system. But in terms being successful in pinnng the source of slowness, its really been a challenge.

We use Jconsole at a high level to determine if server memory and CPU could use increasing and have done that. We have with Cisco assistance tweaked some props and dsp settings as well as applied a few hotfixes between load tests.

We are about to do another load test outside of the load balancer in an attempt to rule that one out as the cause of some issues we see and I wanted to see if there is anything I am neglecting to do.

Here are the questions I have:

What other tools other than Jconsole do you all use during load test? What important indicators do you monitor and watch for aside from th obvious CPU, memory thread count?
We dont script the load test but just have around 15 - 20 folks spawn a few sessions and just let them move around, edit, submit jobs withint the duration of the test - usually 20 minutes. I query the TES schema to monitor who is connected and for how long and how many sessions a user has spawned and which CM they are connected to. I also have them email me any error message or slowness observations in addition to being in a conference call with all of them. We also have hundreds of jobs that kick off/complete as part of daily schedule during this time.
From a 'whether it is worth the trouble or not' perspective - how deeply do you look at the thread waits and blocks to know what is excessive or what is expected? I was reading the Reference guide and Performance tuning guide to gleam more wisdom in deciphering these but I am not clear about what acceptable blocked/wait max queue time, execute time thresholds are. For example, I ran Jconsole to monitor active master on a normal quiet time for 20 - 30 minutes and scrutinized the thread attributes:
- All CORE MD threads have 5 million blocks and waits
- All EVENT EV has around 600-700 blocks and waits.
- All SPECIAL CX has consistently around 1million waits and blocks.
- All COMM COM has consistent around 170K blocks and 13+ million waits

So what do all these numbers mean? Do I just use them as baseline against load test results? Can Cisco support gleam anything useful out of me tracking these cause I have no idea what to make of or do with these numbers myself. I mean, I understand the role that deadlocks and race conditions play, also impact of message queues but there is really nothing on my I can do with them since we don't write the code.

Our DBA says a certain statement ran over 2 million times in the span of an hour during our load test - flagged that as excessive. Brought it up the Cisco development but really not getting any more comment other than it is expected.

For those who have two CMs, do you only use 1 CM during load test?
Is there anyone using GSS or ACE Cisco load balancer that has effectively implemented two CMs without getting a lot of RPC errors out there? Can I get information on how your load balancer is set up or can our network admin compare notes with yours since thre is no guidance from anywhere on this?

Marc Clasby · ‎11-21-2013

Carolanne

We have 2 CMs (different data centers but act like they are on the same network) and are load balancing with GSS

From a GSS perspective we are using Least Busy and using "Sticky" sessions.

We are using the API page to verify if the site is up for a health probe http:\\:8080\api

(if the services are turned off this goes 404 and woudl trigger a failover which works well.)

We created an alias on our domain for the client site tidal..tld so we can give out one link.

We saw that each client connection was making mulitple calls per user which was very interesting to see just how many requests were made just by navigating.. all were being load balanced from server to server until we went with a sticky configuration .

All of our servers run Windows 2008 R2 STD 64Bit and are VMs that leverage some pretty fast SAN Storage.

Our CMs are set up with the "medium" recommendation (16 GB mem, 4x CPUs)

We have installed Tidal on the D:\ drive

Multiple files are still written, etc put on a local drive where tidal is installed for CM

I would imagine I/O to disk might be an issue until the Derby Cache is externalized

We haven't seen performance issues since we externalized the CM Derby cache to it's own SQL 2012 Server / Database and tweaked with a rep from Cisco... I believe both the clientmgr.props and tes-6.0.dsp file were tweak... I'll look up what those changes were and post. Caveat: We do not have a meaningful load on the system from a job/user perspective yet. we are talking a very slow approach given what you and others are reporting.

We haven't had a need to dive into the JConsole yet either but I couldn't understand those numbers either...

com.google.gwt.user.client.rpc.StatusCodeException: 12031 <-- I have a feeling these RPC calls are coming from the client browsers connecting to the CM over the company network not necessarily the fault of CM itself... I am wondering if client settings like screen savers and client connection timeouts will also cause a disruption in the RPC Call process / connnection... for example I had a connection from a server to client via browser and it stayed connected however my connection from my desktop had the RPC call error above when I came back from lunch and unlocked my desktop. There may be some IE browser settings that may help...

If I can think of more I'll post

Marc

Carolanne Fougerat · ‎11-21-2013

Thanks Marc, appreciate it. This is what we have when I asked network admin how LB is configured when I saw we saw that we don't have equal distribution of sessions in the LB during our load test:

On the ACE, there are 3 values for keepalive timing:

Probe Interval – probe interval when a server is marked as up

Pass Detect Interval – probe interval when a server is marked as down

Fail Detect – how many times a probe has to fail before a server is marked down

Our default values are 3-3-3. The first 2 values are in seconds, so 3 seconds between probes, 3 failed probes before a host is taken down, and it polls every 3 seconds after that to check if it’s back up

The load balancing method for Tidal is set to ‘least connections’, so it will send new clients to the server that is least loaded.

A specific client-to-server connection is cached for 60 minutes via sticky timeout ( this is based on unique source IP)

so if a client has visited the site in the last 60 minutes, it will go back to the same server as before regardless of the load on that server.

On top of this, there is a 5-minute TTL value on the DNS record that the client receives from the GSS.

------------------------

Having listed all that - I also learned (from Cisco development) that TES 6.1 web app is never going to be inactive (meeinting web session timeout setting) if left up for however long of a period because it is apparently always communicating with CM behind the scenes. ONly time this setting factors in is when user closes the browser without formally logging out first. So I know now that the RPC errors we get have nothing to do with inactivity timeout. We get different RPC error codes, there 0, 12030, 12019, 12002. I have seen 12029 when the CM went down because its server was bounced. We get RPC at anytime, sometimes immediate after logging in.

The 60 min sticky timeout on the load balancer I mentioned above - I wonder if that an inactivity setting or not. If not, I may want to up that value from 60 minutes. Because if it releases the cache after 60 minutes regardless of activity that would also be a problem - and I don't know what error or RPC that would be, ideally would be nice if I just get logged out automatically if LB decides to put me in the other CM.

Question:

Regarding destop screensavers - I actually do not have it on. But I get these RPC errors myself. What browser setting do you think I can verify? We are having some users use app outside LB to see if they don't get GSS, but we still do.
Also, why did you chose the api page rather than the signin page for the health probe?
Do you use ACE or GSS to offload SSL encryption? We do, and that is also another variable I haven't even thought about.

Marc Clasby · ‎11-29-2013

More details on our GSS/ACE setup looks like what I thought we put in place was not

using port 8080

I chose the api page simply because I didnt' want to go against the client area with a probe (will need to fix with networking team)

Our ACE will probe each server every 3 seconds. It there are 5 failed probes in a row, the ACE will take the server out of service and direct all traffic to the other server. So the ACE will take the server out of service in approximately 15 seconds.

Port 8080

Probe Interval – 3

Fail Detect – 5

Pass Detect Interval – 5

Pass Detect Count - 36

TTL set up on the GSS is for 120 seconds (2 min).

Expect Status 300 302

In order for the AE to put the server back into service, there has to be 36 consecutive successful responses to the probe. But once the server is out of service, the ACE will probe every 5 seconds. So the ACE will put the server back into service approximately 3 minutes after the first successful probe is received.

ACE reports the availability of the VIP via KAL-AP to the GSS, and the GSS will take that answer out of service and direct all traffic to the one VIP that is reporting an active VIP.

The GSS is set up for round robin and the ACEs are set up to use the Least Connections.

Carolanne Fougerat · ‎12-02-2013

Thanks for the additional info.

I have more to add as well since our last load test which was done outside of the load balancer and on hotfix 400. We had about 16-19 users logged in, and some asked to spawn multiple sessions on another browser on their machine to increase load.

The users reported a more stable experience, there was slowness still but only one occurence of the RPC error was reported and the 'stuck in lauched jobs' that this one team consistently experienced in the past load tests did not recur (failing over master fixes this stuck jobs). So now we don't know if its the hotfix or the not-using LB that made for a better experience and lesser bugs ( >_< )

I had already mentioned the additional variable of SSL offloading that GSS/ACE provides for us - we really need SSL because of the passing of the AD credentials and passwords etc but I don't know if that is contributing to the slowness or issues. I really do not want to use SSL on the webservers themselves knowing that CM is already the busiest component as it is.

The other nagging question is whether simply having two CMs running is in itself adding to the complications (for our kind of usage and environment).

I really do feel the users pain when they complain about the slowness using the web. In their minds, why would be upgrade to new architecture that is slower. I myself have been setting up some Tidal downtime jobs and I have to agree that the slowness and errors are frustrating when you have many jobs to add/modify.

Carolanne Fougerat · ‎12-04-2013

Marc - do you all use SSL with TES 6.1? If so, is it offloaded to the load balancer or configured in CM?

Marc Clasby · ‎12-04-2013

Currently we don't use SSL with TES 6.1, what made you go in that direction? Have your tried regular HTTP?

I would bet that is a contributing factor because CM is potentially using many short sessions. Then when you load balance with ACE/GSS you may be constantly handshaking translating into sluggish "visual" behavior.

http://stackoverflow.com/questions/149274/http-vs-https-performance

relevant section

Many, very short sessions means that handshaking time will overwhelm any other performance factors. Longer sessions will mean the handshaking cost will be incurred at the start of the session, but subsequent requests will have relatively low overhead

Carolanne Fougerat · ‎12-04-2013

I had mentioned that since we are passing AD credentials, we really wanted it to go over SSL. Otherwise any one can potentially sniff use passwords if they are in our network (temps/contractors etc). SO I actually have not just the Authentication between the CM and AD server configured in SSL but also the users go in via https.

It really would be nice to know if we are the only ones using SSL so I can bring it up to our decision makers. I understand you point about the frequent short conversations + encryption and decryption overhead. We use SSL offloading with PeopleSoft with no problems, but I am learning that TES 6.1 is an different beast entirely.

Carolanne Fougerat · ‎12-04-2013

sorry to keep this thread longer and longer but I just wanted to be clear. You mentioned you found that CM uses multiple short sessions - how did you determine that again? Was it just by your experience with not setting sticky sessions? Cause I wonder if with the sticky session you really wont see multiple sessions - that the only reason the multiple sessions manifested was because the other CM didn't have the user's sessions credentials. With sticky on - do you still see multiple short sessions?

I just wanted to make sure - I am not a network expert so I am confused a lot and just learning. Like for example, when our network admin said this:

"A specific client-to-server connection is cached for 60 minutes via sticky timeout ( this is based on unique source IP)"

In my interpretations, this means that any connection coming from the same source IP will be cached to same CM -

doesn't matter which session or user (in the situation where multiple users using browser from the save server, a use spawning multiple browser sessions on the same machine).

Reading http://www.cisco.com/en/US/docs/app_ntwk_services/data_center_app_services/ace_appliances/vA3_1_0/configuration/slb/guide/sticky.html#wp1003268

just learned there are different sticky methods.

What is your sticky based on?