Solved: LAG Speed is not as expected

David Lee · ‎06-26-2015

The basics are is that I am not getting the speed I thought I would get out of a (qty 4) 1Gb LAG group.

I have 2 3750X switches stacked. Across the switches, I have 2 ports from each switch in a 4 member LAG attached to one server. The Server has Broadcom NICs and are lagged up through the Broadcom management software in a LACP LAG. Both the Server and the switch report the speed as 4Gb/s. I have a twin server setup the exact same way. When I transfer a test file that is 4.2GB from one server to the other I can only get speeds as if I only have 1 NIC plugged in. The highest speed I got was 110MB/s. I was expecting 4x that speed. Below is a cut from the Switch of the port channel and 1 of the interfaces. Am I just unclear of how LACP works or am I missing something somewhere? This is a new cluster setup that is not operational yet so if I messed up the port channel or interface settings I can change these at any time.

GGS-C3750X-05-STACK#sh int port-channel 16
Port-channel16 is up, line protocol is up (connected)
Hardware is EtherChannel, address is 5057.a888.8390 (bia 5057.a888.8390)
Description: SQLNode 1 LAG
MTU 9000 bytes, BW 4000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is unknown
input flow-control is off, output flow-control is unsupported
Members in this channel: Gi1/0/16 Gi1/0/17 Gi2/0/16 Gi2/0/17
ARP type: ARPA, ARP Timeout 04:00:00

GGS-C3750X-05-STACK#sh run | s interface GigabitEthernet1/0/16
interface GigabitEthernet1/0/16
description SQLNode1 -1
switchport access vlan 5
switchport mode access
spanning-tree portfast
channel-group 16 mode active

Thanks in advance.

David

Peter Paluch · ‎06-26-2015

David,

Your results are perfectly valid and as expected.

The link aggregation technique (or EtherChannel, as Cisco calls it) works by choosing a particular link for each passing Ethernet frame and sending the whole frame over that single link. The particular link selection mechanism on Catalyst switches is always based on the addressing information in the frame and its contents, and can usually be one of the following:

dst-ip       Dst IP Addr
dst-mac      Dst Mac Addr
src-dst-ip   Src XOR Dst IP Addr
src-dst-mac Src XOR Dst Mac Addr
src-ip       Src IP Addr
src-mac      Src Mac Addr

This output is taken directly from a Catalyst 3560 (the options for the port-channel load-balance global configuration command). Some higher switches can also take Layer4 ports into account. These options cause the switch to perform a hashing function over the configured address field (or fields) and use the result of the hash to point to a particular link in the bundle.

Note that for a single data flow between two fixed hosts in a particular direction, all the addressing information is the same - source and destination MACs, source and destination IPs. Consequently, performing the hash on any of the available address field choices will produce the same value, and as a result, the entire flow in one direction will be carried by a single link in an EtherChannel bundle. For the data flow in the opposite direction, the situation is just the same - it will be carried by just a single link (perhaps the same as the original flow, or some other, it does not make a difference). The key takeaway from this is that with an EtherChannel, you will not experience any increase in the throughput for one singular flow because that flow will still be carried by a single link in the bundle. Only when you have multiple flows between various sources and/or destinations, these flows will be distributed across multiple links in the bundle, and the aggregate throughput will be higher.

Some people argue that a simple round-robin technique would solve this issue: Have a switch send frames over all links in a cyclic sequence, i.e. with two links in a bundle, send frame 1 over link 1, frame 2 over link 2, frame 3 over link 1, frame 4 over link 2, etc. Cisco does not implement this method because of an important drawback: This mechanism can cause frames to arrive at the destination in a different order than in which they have been sent out, that is, a frame reordering may occur. Because in a basic simple Ethernet, frame reordering never occurs, and EtherChannel is supposed to be a transparent technique, it must not introduce any new phenomena that were not present with basic Ethernet, and so the frame reordering is a big no-no. Some would argue that TCP will handle the reordering. That is true, however, in general, TCP aggressively slows down if reordering is detected, causing poor performance, and moreover, not all traffic is TCP-based, or for that matter, not even IP-based. An application that uses direct Layer2 encapsulation into Ethernet frames, relying on the fact that Ethernet preserves ordering, would be badly broken.

LACP has nothing to do with how the EtherChannel works and distributes the load across multiple links. LACP is only a signalling/negotiation protocol that allows two directly connected devices to negotiate whether the links are to be aggregated, and performs certain sanity checks to avoid creating an EtherChannel from links that either are not bundled together at the opposite device, do not belong to the same aggregation group at the opposite device, or are not connected to the same opposite device at all. That is its only purpose - to negotiate the creating of the bundle and to make sure that the links can be validly bound without causing issues. Once the EtherChannel has been negotiated and brought up, LACP merely serves a monitoring function but has not impact on how the frames are actually carried over the bundle. Using LACP is highly recommended but the fact you're not seeing a fourfold increase in the throughput is not LACP's fault - it's just the way EtherChannel works, and neither LACP nor its proprietary predecessor PAgP can do anything about it.

What I can recommend is that you make sure on your Catalyst that you base the load-balancing choice in your EtherChannel on the source and destination IP, or even on a source and destination L4 port, as that choice gives you the maximal likelihood that different flows will be spread across different links in the bundle:

SW-Dist1(config)# port-channel load-balance ?
  dst-ip       Dst IP Addr
  dst-mac      Dst Mac Addr
  src-dst-ip   Src XOR Dst IP Addr
  src-dst-mac  Src XOR Dst Mac Addr
  src-ip       Src IP Addr
  src-mac      Src Mac Addr

However, the added value of an EtherChannel starts to show only when there are multiple flows carried by it. A single flow will not experience any improvement.

Best regards,
Peter

View solution in original post

Kevin Dorrell · ‎06-28-2015

David,

It is not all bad news. As you have noticed, the best you can do on a single transfer is 1 Gbps. However, typical network traffic consists of many sessions all going on at once. Each session may be restricted to a single link, but the sessions themselves will be distributed between the links according to the distribution hash algorithm, and that is a matter of statistics. With many sessions going on at once, you should be able to achieve an aggregate throughput well in excess of your 1 Gbps, even if each individual session cannot achieve more than 1 Gbps.

Kevin Dorrell

Luxembourg

View solution in original post

Peter Paluch · ‎06-26-2015

David,

Your results are perfectly valid and as expected.

The link aggregation technique (or EtherChannel, as Cisco calls it) works by choosing a particular link for each passing Ethernet frame and sending the whole frame over that single link. The particular link selection mechanism on Catalyst switches is always based on the addressing information in the frame and its contents, and can usually be one of the following:

dst-ip       Dst IP Addr
dst-mac      Dst Mac Addr
src-dst-ip   Src XOR Dst IP Addr
src-dst-mac Src XOR Dst Mac Addr
src-ip       Src IP Addr
src-mac      Src Mac Addr

This output is taken directly from a Catalyst 3560 (the options for the port-channel load-balance global configuration command). Some higher switches can also take Layer4 ports into account. These options cause the switch to perform a hashing function over the configured address field (or fields) and use the result of the hash to point to a particular link in the bundle.

Note that for a single data flow between two fixed hosts in a particular direction, all the addressing information is the same - source and destination MACs, source and destination IPs. Consequently, performing the hash on any of the available address field choices will produce the same value, and as a result, the entire flow in one direction will be carried by a single link in an EtherChannel bundle. For the data flow in the opposite direction, the situation is just the same - it will be carried by just a single link (perhaps the same as the original flow, or some other, it does not make a difference). The key takeaway from this is that with an EtherChannel, you will not experience any increase in the throughput for one singular flow because that flow will still be carried by a single link in the bundle. Only when you have multiple flows between various sources and/or destinations, these flows will be distributed across multiple links in the bundle, and the aggregate throughput will be higher.

Some people argue that a simple round-robin technique would solve this issue: Have a switch send frames over all links in a cyclic sequence, i.e. with two links in a bundle, send frame 1 over link 1, frame 2 over link 2, frame 3 over link 1, frame 4 over link 2, etc. Cisco does not implement this method because of an important drawback: This mechanism can cause frames to arrive at the destination in a different order than in which they have been sent out, that is, a frame reordering may occur. Because in a basic simple Ethernet, frame reordering never occurs, and EtherChannel is supposed to be a transparent technique, it must not introduce any new phenomena that were not present with basic Ethernet, and so the frame reordering is a big no-no. Some would argue that TCP will handle the reordering. That is true, however, in general, TCP aggressively slows down if reordering is detected, causing poor performance, and moreover, not all traffic is TCP-based, or for that matter, not even IP-based. An application that uses direct Layer2 encapsulation into Ethernet frames, relying on the fact that Ethernet preserves ordering, would be badly broken.

LACP has nothing to do with how the EtherChannel works and distributes the load across multiple links. LACP is only a signalling/negotiation protocol that allows two directly connected devices to negotiate whether the links are to be aggregated, and performs certain sanity checks to avoid creating an EtherChannel from links that either are not bundled together at the opposite device, do not belong to the same aggregation group at the opposite device, or are not connected to the same opposite device at all. That is its only purpose - to negotiate the creating of the bundle and to make sure that the links can be validly bound without causing issues. Once the EtherChannel has been negotiated and brought up, LACP merely serves a monitoring function but has not impact on how the frames are actually carried over the bundle. Using LACP is highly recommended but the fact you're not seeing a fourfold increase in the throughput is not LACP's fault - it's just the way EtherChannel works, and neither LACP nor its proprietary predecessor PAgP can do anything about it.

What I can recommend is that you make sure on your Catalyst that you base the load-balancing choice in your EtherChannel on the source and destination IP, or even on a source and destination L4 port, as that choice gives you the maximal likelihood that different flows will be spread across different links in the bundle:

SW-Dist1(config)# port-channel load-balance ?
  dst-ip       Dst IP Addr
  dst-mac      Dst Mac Addr
  src-dst-ip   Src XOR Dst IP Addr
  src-dst-mac  Src XOR Dst Mac Addr
  src-ip       Src IP Addr
  src-mac      Src Mac Addr

However, the added value of an EtherChannel starts to show only when there are multiple flows carried by it. A single flow will not experience any improvement.

Best regards,
Peter

David Lee · ‎06-27-2015

Cheers Peter,

So I guess my next question should be is there a way to link 1 Gb/s links up so that the bandwidth is increased as well as redundancy and failover?

David

Peter Paluch · ‎06-27-2015

Hi David,

is there a way to link 1 Gb/s links up so that the bandwidth is increased as well as redundancy and failover?

Unfortunately, I know of nothing of the sort. Cisco avoids round robin for EtherChannels and there is, to my best knowledge, no other Ethernet-based technology that would split a single frame, carry it over parallel links and then reassemble and forward it - doing so would require processing speeds difficult to achieve on a wire rate. I am sorry to disappoint you here.

Best regards,
Peter

David Lee · ‎06-27-2015

It happens. Thanks for the information. I was hoping that I could get the best of both worlds. Get link resiliency and increased bandwidth. unfortunately at this time, my company wont let me upgrade everything to 10gig. Even if they did, I would have to have at least 2 links from each server connected for fail-over since these are database servers.

Kevin Dorrell · ‎06-28-2015

David,

It is not all bad news. As you have noticed, the best you can do on a single transfer is 1 Gbps. However, typical network traffic consists of many sessions all going on at once. Each session may be restricted to a single link, but the sessions themselves will be distributed between the links according to the distribution hash algorithm, and that is a matter of statistics. With many sessions going on at once, you should be able to achieve an aggregate throughput well in excess of your 1 Gbps, even if each individual session cannot achieve more than 1 Gbps.

Kevin Dorrell

Luxembourg

Peter Paluch · ‎06-28-2015

Kevin,

Thanks for joining! You've said that very well - to the point, clearly and concisely. A perfect wrap up of all things important.

Best regards,
Peter

Reza Sharifi · ‎06-27-2015

David,

Peter has already answered your questions perfectly!!. I just want to add to it and let you know if bandwidth increase is very important to you and your servers have 10Gig links, you could get a 10Gig fiber module for your 3750s and use those to uplink to the servers. Of course, if you are alredy using the 10Gig uplink for something else, than there is no way to increase the bandwidth.

If you can wait Cisco is coming up with a new 3850 switch that will have 24 10Gig copper ports and 2 40Gig ports for uplink. This is due in a few months.

HTH

Peter Paluch · ‎06-28-2015

Hi Reza,

Thank you for joining - and thanks for the kind words!

I agree with your suggestion - EtherChannel is, to a certain point, a poor-man's "faster Ethernet". It has very nice properties of providing high availability, redundancy, and even higher aggregated throughput which is nice in a macroscopic view, but it all boils down to utilizing multiple Ethernet links for multiple flows, with one flow being carried by a single link only. In some environments, this is a welcome improvement and an added value. But the technology is not really a replacement for a faster Ethernet link, and if individual flows require more throughput than provided by any single link in an EtherChannel bundle then this technology just isn't the tool to make up for a lack of bandwidth.

As always, it all depends on what the requirements are.

It's nice to hear about the upcoming 3850 switches! The notion of 24x 10GigE is just ridiculous :) I remember 2400bps modems, you know :)

Best regards,
Peter

Reza Sharifi · ‎06-28-2015

Hi Peter,

It's nice to hear about the upcoming 3850 switches! The notion of 24x 10GigE is just ridiculous :) I remember 2400bps modems, you know :)

Absolutely, Cisco is trying to keep up with the market demand and competition.

I even remember 300bps modems and 10Mb hard drives :)

Thanks,

Reza