WAAS optimization of Mimix replication - what is limiting throug

leferguson · ‎09-03-2011

Scenario: Two 574 devices running 4.1.3b, with 99% of the workload replication traffic between two iSeries systems running Mimix.

Everything works beautifully, getting 30:1 compression pretty consistently, no errors, no problems evident, no alarms, CPU on WAE's under 50% usually under 20%.

Except: Initially we had 4.5mbs between the sites over MPLS, and just changed to 100mbs point to point with very low latency (3-5ms).

And performance did not go up. Well, not much. And I cannot figure out why. Specifically I cannot tell what is the limiting factor now.

The devices are set up in WCCP L2 on 3750 switches, and the ports all show clean counts, the two switches are loafing (95% idle generally). The WAE's CPU is also loafing the compressing side always < 50%, mostly less than 20%. The 100mbs (compressed traffic) circuit as seen by the switches (it is a L2 connection) is never more than 20% loaded, usually 10% or less. No other traffic runs over this link.

No alerts, no bad counts showing on the WAE's (the only one that I do not know is under SHOW STAT TCP the sending WAE shows about 10% of the packets as "TCP receiver collapsed" which I think is just saying its out of order stack resolved?).

I've increased TCP buffers, turned adaptive buffers off and on, no impact with any settings I have tried.

The NIC's on the iSeries are definitely 1G. Unfortunately I have no other systems at the receiving end, so I cannot eliminate the iSeries itself (e.g. by loading up some non-replication traffic). But that's a tiny amount of data for the iSeries to process (both ends are pretty large boxes).

Latency (measured via pings) on the interconnect is low, usually 2-5ms whenever I send large pings. The MSS is high, pings with DF set go through at 1500 bytes, and there is only this one path, so the topology is simple (iseries -> 3750 -(WCCP)-> WAE -(ipForward)-> 3750 -> 100mbs link -> 3750 -(WCCP)-> WAE -(ipForward)-> 3750 -> iSeries).

Something, somewhere, is limiting the flow rate. I really expected the increase to 100mbs to swamp the WAE's CPU, and it to be the bottleneck, but that does not appear to be it. At least not per CPU.

How can I tell which device is slowing it down?

w.schieffer · ‎01-09-2013

Did you ever find a resolution to this issues? I am looking to utilize WAAS to optimize Mimix. thank you.

leferguson · ‎01-09-2013

Well, yes and no. We eventually became convinced that the delay was primarily on the iSeries and in how quickly updates could be applied. That was not the network issue per se, but it is where we finally had to put the resources to fix the problem, and it did. How this impacted the transmit speed I still do not understand -- my only conclusion is at some level Mimix wasn't able to acknowledge fast enough and retransmission delays caused some kind of throttle-like effect.

The network side was more problematic, and I never got good answers, but I fundamentally believe that, also, the iSeries was simply not able to send data fast enough to have the network link become a steady limiting factor So even when we fixed the receiving system, the sending system (which is also burdened with running production) could only send a 100mbs or so over the link, in wildly eratic highs and lows, but over time it just did not keep it busy. Maybe.

HOWEVER, we continue to struggle with the WAAS devices and this particular link. We also run much more easily measured file copies over it. At times it runs very well, fast, high compression. At other times it is flakey - extremely slow (much slower than the uncompressed speed), and file transfer failures (mostly on CIFS copies in Windows). No identifiable cause, but only when compressing. Take compression off (for just that server pair) at the access list in the switch and it runs fine.

I believe we are hitting some limitation in the WAAS devices. I believe that even though they show no alarm or errors that we can find, that as they get busy they fail ungracefully. I have zero evidence for this belief, but it is the only thing remaining. I suspect this may also have been playing a role in Minimix, just not so easy to see.

We have (or are supposed to have) a ticket open with Cisco on the subject.

Short answer: no. And after a lot of work I find the WAAS device pretty easy to tell what is wrong when things fail entirely (e.g. a connection doesn't compress), but extremely hard to find useful information when they function, but run too slowly.

WAAS optimization of Mimix replication - what is limiting throughput?