Solved: Shaping traffic because a port is overloaded

John Blakley · ‎03-05-2009

All,

I'm attaching a diagram for what I'm currently experiencing.

Port 5/0/5 on our 3750 connects to our 3745 router. Port 5/0/5 is constantly going from 35% to >95% utilization from this one server. It's our SAN server, and apparently it's replicating back to our DR site. Is there a way to shape this traffic, and if so, where would I create the policy? On the switch or the router, and which interface would it be applied to? NAT isn't used in this scenario.

Edit: All of our traffic to our branches go out of this port, so whatever I do, I think it needs to be done by an acl so it matches just the traffic from the SAN. Am I correct?

Thanks!

John

HTH, John *** Please rate all useful posts ***

Joseph W. Doherty · ‎03-06-2009

"I'm not sure of the bandwidth requirements for the SAN, and I've not heard of any complaints regarding speed. But, for ease of understanding, say that I guaranteed a min bandwidth for the SAN to replicate across the link. Would that help with the bursty nature of the SAN going through that port? "

No, it would only help insure other traffic doesn't adversely impact SAN replication.

"Or, would that just tell the interface "SAN is allowed 10mb on 100mb port, BUT if there's more allow it to have more." I know that I can police the traffic minimums, burst, etc, but if I applied a policy that gave a minimum 10mb, would that drop all of the "available" to others down to a 90mb port or even less if we have to consider the 25% overhead for network control?"

Yes, if you've set a floor of 10 Mbps for SAN out of 100 Mbps, other traffic wouldn't be able to acquire more than 90 Mbps if it wanted it unless SAN used less. Conversely, as a floor or minimum, SAN could use more that 10 Mbps if other traffic wasn't using it.

If you don't believe there are any performance issues, you, again, likely don't need to do anything. Only if the bursty SAN traffic is causing other issues, might compel you to do something.

Although, if SAN replication is as bursty as you note, I would expect at least brief transitory performance issues; but many live with typical best-effort networks without knowing it can often be better. Many assume inconsistent network performance is normal (and it often is in best-effort only networks that are oversubscribed and especially don't, by default, use FQ [such as 3750s don't support FQ]).

Shaping or policing can be used for many purposes, one of which is upstream control, especially when you don't have later downstream control. For example, if the link to the backup datacenter was a T-3, one might want to police the SAN replication at the edge to 45 Mbps. You know more bandwidth isn't available later, so why let it congest later? However, assuming the downstream link shares the T-3, does it makes sense to further limit SAN replication at less than 45 Mbps? It might if you have no control over downstream congestion, but if we did, and since we don't know upstream what the congestion is downstream, it's better to manage the congestion there, where it forms. This doesn't preclude still limiting the SAN source to send at 45 Mbps, but then we need to manage bandwidth at two points. For the cost of such management, we avoid sending "too much" traffic before it gets to the later congestion management point, at least for one source. If the traffic is something like TCP, it won't much go beyond the downstream congestion point's bandwidth because it will self regulate its flow rate. Given this, and issues with managing both upstream and downstream, I've found there's often little benefit to upstream rate limiting if we're going manage the bandwidth downstream. (Note: there's always exceptions.)

PS:

BTW, the 25% you're likely thinking of is the default bandwidth you can't explicitly allocate to defined CBWFQ classes unless you override the reserved default. Although 25% is set aside for bandwidth allocations, it can still be used by other traffic. (Also 3750s don't support CBWFQ like your 3725.)

View solution in original post

Edison Ortiz · ‎03-05-2009

I recommend placing QoS closest to the source hence inbound from the SANS device to the switch would be ideal.

HTH,

__

Edison.

Joseph W. Doherty · ‎03-05-2009

What you could do on the 3750 is deprioritze the SAN replication traffic so that it only uses bandwidth otherwise unused. This would be done by directing this traffic to its own egress queue with minimum weight in shared mode (i.e. srr-queue bandwidth share).

If the overall load is too much for the 3745, use srr-queue bandwidth limit to "shape" the port rate.

Edison Ortiz · ‎03-05-2009

I will disagree on this approach.

On the srr-queue bandwidth share command the absolute value of each weight is meaningless, and only the ratio of parameters is used.

As for the srr-queue bandwidth limit, it will affect everyone on that location, not just the SANS device.

Best to use MQC with police inbound on the SANS port. Why take the traffic in just to drop it at egress?

Joseph W. Doherty · ‎03-05-2009

"On the srr-queue bandwidth share command the absolute value of each weight is meaningless, and only the ratio of parameters is used. "

Correct, but that's the idea. If we could use a policy map, it would be something like:

policy-map x

class-map besteffort

bandwidth remaining percent 99

class-map SAN

bandwidth remaining percent 1

(The above is treating SAN replication traffic, more or less, like scavenger class.)

"As for the srr-queue bandwidth limit, it will affect everyone on that location, not just the SANS device. "

Yes and no. It will affect everyone in that it limits the overall rate to what the 3745 can accept, but there's no point in driving the link with more traffic than the 3745 can process regardless whether it's SAN traffic or other traffic. However, within the bandwidth capacity of the 3745, SAN will effectively only have "left over" bandwidth.

"Best to use MQC with police inbound on the SANS port. Why take the traffic in just to drop it at egress? "

Because on egress we're dropping SAN against total aggregate congestion, i.e. drops more if there's other traffic that needs the bandwidth, drops less if other traffic doesn't need the bandwidth. With an inbound policer, you drop all the time and either don't fully utilize excess bandwidth or conversely allow the policed (SAN) traffic to obtain bandwidth you would prefer other traffic to obtain.

Edison Ortiz · ‎03-05-2009

You aren't addressing John's concern on shaping or policing the traffic.

John's isn't looking to guarantee one traffic over the other, he is looking to control the burst traffic the SAN is creating on his network.

I understand that police will limit the traffic to an X value but it's up to John to determine what's the adequate X value the SAN can burst to.

__

Edison.

Joseph W. Doherty · ‎03-06-2009

"You aren't addressing John's concern on shaping or policing the traffic.

John's isn't looking to guarantee one traffic over the other, he is looking to control the burst traffic the SAN is creating on his network.

I understand that police will limit the traffic to an X value but it's up to John to determine what's the adequate X value the SAN can burst to. "

Perhaps, or perhaps not. I suspect his real concern isn't so much just the need to shape or police the SAN traffic, but as you note "control the burst". One must ask, why control the burst? Is it because we don't want a link to hit 95% utilization, could be, or is the concern really what such bursts might do to other traffic sharing the link, which is mentioned in the OP ("All of our traffic to our branches go out of this port"), or performance impact to the 3745 (not explicitly mentioned)? If the former (i.e. we don't want SAN to exceed some %), yes policing the SAN traffic could be used to define "adequate" bandwidth. If the latter (i.e. don't degrade other traffic [and/or the 3745]), bursts are controlled such that there's effectively no impact to other traffic (and/or the 3745), which isn't guaranteed with policing just SAN traffic.

BTW, there's no reason why both policing and bandwidth ratio management can't be combined, but usually there's little need to do so if bandwidth allocations between traffic can be managed. There are situations where the platform and situation doesn't provide the capability to manage bandwidth, and policing is your only option, but this isn't one of them with the 3750.

[edit]

Just so there's no confusion, what I'm suggesting is bandwidth traffic management 1st, port limiting is optional depending on load impact to 3745, but port limiting, alone, although it would guarantee router's performance, like policing, wouldn't guarantee traffic performance.

Also, I have much real world experience with something similar. Remote sites that back up both hosts and servers across the WAN. These backup applications drive the WAN link to 100% utilization for hours during normal business hours (mainly laptop hosts that connect only during the day - server backups scheduled in the "early hours"). On the same link, that's running 100% for hours, run other business applications, including VoIP, w/o problem when above approach can be used. Few sites have L3 equipment that only supports policing, and backup traffic is policed, but there's no quality of service, and it's noticable regardless of "adequate" bandwidth that the backup is policed to. (In fact, doesn't have to be any backup traffic for there to be user complaints.)

Edison Ortiz · ‎03-06-2009

Our real world experience isn't up for debate. What you did for your customers may not be what John wants. I have countless design under my belt and no design is ever alike. All designs need to accommodate customer's needs first, then you use the technology to address it. Not the other way around.

You start by saying:

I suspect his real concern isn't so much just the need to shape or police the SAN traffic,

Yet, he used the word shape several times on his initial message.

Then you said:

is the concern really what such bursts might do to other traffic sharing the link

The way I read it, John wants to perform this QoS only on the SAN device and leave the remaining traffic the way it is now.

You and I are seeing John's request from different angle. We understand the technology but it seems only one of us really got his request. I will wait for John's reply and see who is closer to what he wants.

__

Edison.

Joseph W. Doherty · ‎03-06-2009

Didn't intend to debate real world experience, nor intend such now, but I have run into the (common) situation that many aren't used to the concept of managing traffic using various QoS techniques.

Policing is an obvious solution to restricting some traffic's link utilization, but my usage of a real world example was to highlight a case, that's somewhat similar; to demonstrate what we may want to manage is SAN's traffic impact, not just link utilization.

Perhaps you're correct, ". . . John wants to perform this QoS only on the SAN device and leave the remaining traffic the way it is now.", but he may not be aware of other possiblities nor pitfalls. Again, just policing SAN traffic to some %, can still allow the link to burst to full utilization from other traffic although probably not as often, and/or doesn't guarantee other traffic isn't degraded by SAN bandwidth utilization. For instance, if you limit SAN to 25%, that's 25% unavailable for other traffic.

I agree we're seeing John's request from different angles. Your approach is more of a direct technical answer, i.e. you want to limit SAN traffic, do this. My approach assumes there's an underlying issue, even if not explicitly stated, which is more than we just don't want SAN to use more than X% bandwidth, but is instead, we don't want SAN bursts to adversely impact other trafffic.

In other words, you may indeed answered John's request, and even what John wants. I've tried to provide information to assist John on what he might need, even if he doesn't realize it.

What's somewhat puzzling is why you disagreed with my suggestion or are making such a fuss. Even if you're 100% correct, i.e. your answer is exactly what John desires, so what? He can choose it, give you a 5 and mark your answer as question resolved. Is there some pitfall to my suggestion you see? Some risk to John or others using my suggestion? If there is, I welcome correction, but if what I suggest isn't what John wants, so what? If it doesn't help John, it might be someone else finds it of interest when reading these forums.

John Blakley · ‎03-06-2009

I want to thank you BOTH for such great answer and direction.

Joseph brings up a good point in that I don't think I explained in my OP the detail of the "problem" that I'm experiencing. The 3745 isn't being overloaded, as far as I can tell, but the port that the 3745 connects to in the switch does get up to 95-99% utilization when the SAN bursts.

I didn't figure that policing would be good since it would drop the traffic when the queue size gets full, and I've been told (I'm not a SAN admin) that the concern would be if the SAN can't sync up quick enough, it could cause a problem. (I have no way of verifying this unless I called EMC.)

That's why I used shaping as my other alternative, and I wanted to use shaping outbound. I don't know if I need to apply it on the port that the SAN connects to, or the port that the traffic goes out of the 3750. I would think I would want to apply it inbound on the port that the SAN connects to, but as Joseph probably has seen in my other posts, I have a hard time with the direction these should be placed in. (I try to apply them like I would an ACL.)

Now, overall, I would like the SAN to use little bandwidth during the day and as much as it wants at night. I don't know the first thing about QoS, but I do know I have a real need for it in this situation. I'm kinda doing this blindly, and, I don't want to affect everything else.

I really appreciate both of your suggestions.

Thanks!

John

HTH, John *** Please rate all useful posts ***

Edison Ortiz · ‎03-06-2009

Hi John,

Thanks for expanding your requirements.

If the bursty nature of the SAN device is affecting throughput for other services in your network, then Joseph's approach will be ideal in this situation.

If you want to avoid burst from the SAN device, then you need to control that traffic and the only solution is policing.

You can't shape inbound and shaping outbound in the 3750 is very cumbersome.

You would need to allocate the SAN traffic to a queue and shape that queue using SRR.

Keep in mind, just like policing, shaping drops traffic as well. Shaping stores the packet a bit longer in the buffers but in bursty situations and if your shaping % is lower than the demand, traffic will be dropped.

John Blakley · ‎03-06-2009

Can I not shape outbound on the port that the SAN is connected to?

Where is the "SRR" commands held: 3750 or 3745?

Can you point me in the direction of good documentation to do this?

Thanks!

John

HTH, John *** Please rate all useful posts ***

Edison Ortiz · ‎03-06-2009

Shaping outbound on the port that is connected to the SAN will control traffic coming from the remote SAN.

You need to see the traffic flow from the switch's perspective.

The SRR command we are talking will be in the switch.

QoS on the 3750 can be found here:

http://www.cisco.com/en/US/docs/switches/lan/catalyst3750/software/release/12.2_46_se/configuration/guide/swqos.html

HTH,

__

Edison.

John Blakley · ‎03-06-2009

Will I need to do anything in the router, or will the traffic stay controlled to the destination?

Thanks,

John

HTH, John *** Please rate all useful posts ***

Edison Ortiz · ‎03-06-2009

We've discussed several designs. Which design are you selecting?

You are concerned about the router interface showing 95% utilization but are the users in the location complaining about slowness during that period of time?

___

Edison.