Solved: Re: Shaping traffic because a port is overloaded - Page 2

John Blakley · ‎03-05-2009

All,

I'm attaching a diagram for what I'm currently experiencing.

Port 5/0/5 on our 3750 connects to our 3745 router. Port 5/0/5 is constantly going from 35% to >95% utilization from this one server. It's our SAN server, and apparently it's replicating back to our DR site. Is there a way to shape this traffic, and if so, where would I create the policy? On the switch or the router, and which interface would it be applied to? NAT isn't used in this scenario.

Edit: All of our traffic to our branches go out of this port, so whatever I do, I think it needs to be done by an acl so it matches just the traffic from the SAN. Am I correct?

Thanks!

John

HTH, John *** Please rate all useful posts ***

John Blakley · ‎03-06-2009

"...are the users in the location complaining about slowness during that period of time?"

Actually, no they're not. I figured that it would be better than having the high utilization. =)

Thanks!

John

HTH, John *** Please rate all useful posts ***

Joseph W. Doherty · ‎03-06-2009

"I didn't figure that policing would be good since it would drop the traffic when the queue size gets full, and I've been told (I'm not a SAN admin) that the concern would be if the SAN can't sync up quick enough, it could cause a problem. (I have no way of verifying this unless I called EMC.) "

Yes, that's a valid concern. If the backup replication can't keep up with original, the replica SAN device can lose sync with the original. (If fact, from a QoS perspective, there could become a need to guarantee a minimum amount of bandwidth to keep the backup replica current.)

If there isn't any "problem" beyond seeing the link hit high utilization, you really don't need to do anything. But you write, "I do know I have a real need for it in this situation." So, other than seeing the link get busy, what's your concern? If you only want to avoid seeing the link busy, then policing is a simple solution. If you want to keep the SAN replication from adversely impacting non-SAN traffic, policing can help but it wouldn't be as "good" as queue management.

If SAN replication does have a minimum bandwidth requirement, that can be accomplished by how the egress queues are weighted. At the queue level, as Edison mentions, shaped mode (SRR) can be used, but if the bandwidth is available, i.e. not otherwise needed, why not alllow SAN to utilize if it wants, regardless of the time of day?

What I would expect to be important: a) SAN doesn't adversely impact other traffic, b) SAN obtains at least the bandwidth it needs to maintain sync.

PS:

Looking at your diagram, one could also consider traffic flows in/out the 3745, but so far, you've only mentioned replica SAN traffic and a busy link between the 3750 and 3745 being of concern.

Edison Ortiz · ‎03-06-2009

Hey Joseph, great post and I agree with everything you said ;)

Joseph W. Doherty · ‎03-06-2009

Thank you.

I thought yours with "Ok, I won't disagree with any of your posts in the future... " was even better!

It demonstrates one of the benefits of these forums; how it helps people to improve, to learn . . .

ROFL

John Blakley · ‎03-06-2009

Joseph,

I'm not sure of the bandwidth requirements for the SAN, and I've not heard of any complaints regarding speed. But, for ease of understanding, say that I guaranteed a min bandwidth for the SAN to replicate across the link. Would that help with the bursty nature of the SAN going through that port?

Or, would that just tell the interface "SAN is allowed 10mb on 100mb port, BUT if there's more allow it to have more." I know that I can police the traffic mininmums, burst, etc, but if I applied a policy that gave a minimum 10mb, would that drop all of the "available" to others down to a 90mb port or even less if we have to consider the 25% overhead for network control?

Thanks,

John

HTH, John *** Please rate all useful posts ***

Joseph W. Doherty · ‎03-06-2009

"I'm not sure of the bandwidth requirements for the SAN, and I've not heard of any complaints regarding speed. But, for ease of understanding, say that I guaranteed a min bandwidth for the SAN to replicate across the link. Would that help with the bursty nature of the SAN going through that port? "

No, it would only help insure other traffic doesn't adversely impact SAN replication.

"Or, would that just tell the interface "SAN is allowed 10mb on 100mb port, BUT if there's more allow it to have more." I know that I can police the traffic minimums, burst, etc, but if I applied a policy that gave a minimum 10mb, would that drop all of the "available" to others down to a 90mb port or even less if we have to consider the 25% overhead for network control?"

Yes, if you've set a floor of 10 Mbps for SAN out of 100 Mbps, other traffic wouldn't be able to acquire more than 90 Mbps if it wanted it unless SAN used less. Conversely, as a floor or minimum, SAN could use more that 10 Mbps if other traffic wasn't using it.

If you don't believe there are any performance issues, you, again, likely don't need to do anything. Only if the bursty SAN traffic is causing other issues, might compel you to do something.

Although, if SAN replication is as bursty as you note, I would expect at least brief transitory performance issues; but many live with typical best-effort networks without knowing it can often be better. Many assume inconsistent network performance is normal (and it often is in best-effort only networks that are oversubscribed and especially don't, by default, use FQ [such as 3750s don't support FQ]).

Shaping or policing can be used for many purposes, one of which is upstream control, especially when you don't have later downstream control. For example, if the link to the backup datacenter was a T-3, one might want to police the SAN replication at the edge to 45 Mbps. You know more bandwidth isn't available later, so why let it congest later? However, assuming the downstream link shares the T-3, does it makes sense to further limit SAN replication at less than 45 Mbps? It might if you have no control over downstream congestion, but if we did, and since we don't know upstream what the congestion is downstream, it's better to manage the congestion there, where it forms. This doesn't preclude still limiting the SAN source to send at 45 Mbps, but then we need to manage bandwidth at two points. For the cost of such management, we avoid sending "too much" traffic before it gets to the later congestion management point, at least for one source. If the traffic is something like TCP, it won't much go beyond the downstream congestion point's bandwidth because it will self regulate its flow rate. Given this, and issues with managing both upstream and downstream, I've found there's often little benefit to upstream rate limiting if we're going manage the bandwidth downstream. (Note: there's always exceptions.)

PS:

BTW, the 25% you're likely thinking of is the default bandwidth you can't explicitly allocate to defined CBWFQ classes unless you override the reserved default. Although 25% is set aside for bandwidth allocations, it can still be used by other traffic. (Also 3750s don't support CBWFQ like your 3725.)

John Blakley · ‎03-06-2009

Thanks Joseph!

So, do we lose downstream control when we have an inbound connection that we don't own? In other words, is my control lost at an edge router connected to an ISP, but wouldn't be lost if I had router control at both ends of a P2P T1 link? I've been a little confused as to why we can't really shape traffic coming to us, say from the internet. I guess we just police that traffic? If I've got an FTP site on a 20mb connection, but I don't want it to ever use more than 2mb because I also have a game server, would you normally police on that port inbound?

Sorry if it seems like I got off topic. QoS is a really big subject, and I'm trying to get a grasp on it.

Thanks!

John

HTH, John *** Please rate all useful posts ***

Joseph W. Doherty · ‎03-06-2009

A shaper requires a queue to store overspeed packets, which is why shapers are configured outbound only. (In theory you could do it on the inbound interface, but since you can already do in on an outbound interface, not much point.)

A policer doesn't require a queue, so it works about the same inbound or outbound. (Another purpose for a policer might not be to drop overspeed traffic, but to tag it based on its bandwidth usage. This might be one reason why a policer is supported on input where a shaper isn't.)

Some traffic will regulate its flow rate when it sees drops (or ECN [or even jump in RTT]), some will not. For the latter, neither a shaper nor policer will keep such a flow from sending as much traffic as it desires upstream of the control point (both would control downstream).

For the former, i.e. traffic that regulates its bandwidth based on seeing drops, e.g. TCP, will slow its transmission rate, but only after seeing one or more drops. Plus, at least with TCP, it actually sends traffic as quickly as possible because it doesn't manage actual transmission rate, but how many packets to send back-to-back. Something like TCP then can "burst" into a large share of bandwidth before it knows to slow, and/or the actual burst can fill a link (for a time) beyond the downstream policer/shaper's bandwidth setting.

What this means, it's difficult to impossible to regulate transmission rate upstream of our control point, although again, we can regulate it downstream of our control point.

In your question about Internet, if the WAN link was the primary congestion point to/from the Internet, we would often want to do something outbound on both ends of the link. If one end is controlled by the ISP, and they won't allow us to control their side's outbound, we cannot obtain the same level of control on our side inbound.

For instance, on their side, we can (usually) easily limit FTP to 2 Mbps of the 20. On our side, we can police the inbound FTP to 2 Mbps (or shape 2 Mbps outbound on router heading toward our network), but FTP might still use more than 2 Mbps on the WAN link, when bursting (especially if it's still in "slow start").

PS:

BTW, for TCP, I've noticed if you police much slower than the "nominal" rate inbound, you might average you're nominal rate. E.g. for 2 Mbps policing inbound, somewhere between 10 to 50% seems to come close with default burst intervals. (I haven't tried it, but I suspect tuning burst interval down might allow more precise control.)

For TCP you can also shape outbound ACKs, but since every other packet is normally ACKed, inbound packet sizes vary, and since ACKs can piggyback, very difficult to target a specific inbound rate.

John Blakley · ‎03-06-2009

Thanks Joseph!

John

HTH, John *** Please rate all useful posts ***

Edison Ortiz · ‎03-06-2009

What's somewhat puzzling is why you disagreed with my suggestion or are making such a fuss

Ok, I won't disagree with any of your posts in the future...

__

Edison.

jplowick3 · ‎05-28-2009

I have a similar issue to this. We have two SANs are that are setup to replicate over a 85 Mbps MAN connection. The main site has a 3745 router, and the remote site has a 2821. If I let the SAN replication traffic go unchecked, it will consume the entire link causing problems for my users (at the remote site). Right now I am using the "traffic-shape group" command to match traffic from the SAN based on IP and limit it to 25 Mbps in the routers. This seems to work, but I would like to allocate more bandwidth without impacting our users or backups (basically they get priority over SAN traffic). What commands would I need to implement something like this? I've played around with some different policy maps, but can't seem to get them right.

Joseph W. Doherty · ‎05-28-2009

Assuming the MAN is your primary bottleneck, you would want a CBWFQ policies somewhat like this:

policy-map 85Mbps

class class-default

!allow 5 to 15% for Ethernet overhead

shape average 77000000

service-policy prioritzetraffic

policy-map prioritzetraffic

!class that matches your SAN traffic

class SANtraffic

!adjust bandwidth as low as possible to meet minimum bandwidth needs

!remember class can use more if available

bandwidth remaining percent 1

class class-default

fair-queue

On both routers egress

interface ethernet #

service-policy output 85Mbps

jplowick3 · ‎05-29-2009

I'll give that a shot. I didn't think of using the bandwidth remaining command. I had tried just using the bandwidth command and that filled up the MAN connection causing slowness for the users.

Joseph W. Doherty · ‎05-29-2009

Nothing special about the "bandwidth remaining". What's important is class bandwidth ratios. If you try 1%, other traffic should obtain a higher priority vs. the SAN traffic, if such traffic wants the bandwidth. Yet SAN can still use 100% of the link (if the bandwidth is available).

What I'm suggesting, shouldn't (much) delay non-SAN traffic if there's SAN traffic, but SAN traffic can be delayed by non-SAN traffic.

You will need to insure that SAN isn't too starved for bandwidth, if you go with a low bandwidth %. Also know, on most small router platforms and IOSs, FQ is a special case within class-default and you might not see a 1:99 bandwidth ratio.