Solved: 802.3ad link aggregation switch to switch bandwidth: expecting 2Gb, getting 1Gb

Matthew Millman · ‎05-10-2016

I've just encountered some behavior with dynamic link aggregation between switches which I wasn't expecting -

I have this scenario, I'm expecting 2.0gbps aggregate bandwidth from the server to the PCs, and I get 2.0gbps, as expected

     +---------+
     |  Server | <--- Using NIC Teaming
     +---------+
        |   | GbE x2 (802.3ad)
     +---------+
     | 2960G 1 |
     +---------+
        |   |
    +--+   +--+
    |         | 
+-------+ +-------+ 
|  PC   | |  PC   |
+-------+ +-------+

But when I change the topology to this:

     +---------+
     |  Server | <--- Using NIC Teaming
     +---------+
        |   | GbE x2 (802.3ad)
     +---------+
     | 2960G 1 |
     +---------+
        |   | GbE x2 (802.3ad)
     +---------+
     | 2960G 2 |
     +---------+
        |   |
    +--+    +--+
    |          | 
+-------+ +-------+ 
|  PC   | |  PC   |
+-------+ +-------+

Now I'm down to 1.0gbps, and I don't get it. Everything appears to be OK.

On switch 1:

Switch#show lacp nei

Partner's information:

(Link to server)

LACP port Admin Oper Port Port
Port Flags Priority Dev ID Age key Key Number State
Gi0/1 FA 0 0030.xxxx.xxxx 25s 0x0 0x0 0x2 0x3F
Gi0/2 FA 0 0030.xxxx.xxxx 26s 0x0 0x0 0x1 0x3F

Channel group 2 neighbors

Partner's information:

(Switch to switch link)

LACP port Admin Oper Port Port
Port Flags Priority Dev ID Age key Key Number State
Gi0/3 SA 32768 0025.xxxx.xxxx 24s 0x0 0x1 0x102 0x3D
Gi0/4 SA 32768 0025.xxxx.xxxx 10s 0x0 0x1 0x103 0x3D

Switch 2:

Channel group 1 neighbors

Partner's information:

LACP port Admin Oper Port Port
Port Flags Priority Dev ID Age key Key Number State
Gi0/1 SA 32768 04c5.xxxx.xxxx 13s 0x0 0x2 0x104 0x3D
Gi0/2 SA 32768 04c5.xxxx.xxxx 12s 0x0 0x2 0x105 0x3D

Grateful for any thoughts as to how to solve this!

devils_advocate · ‎05-11-2016

The source mac load balancing may be your issue.

Change it on both switches to be:

#conf t
#port-channel load-balance src-dst-ip

View solution in original post

devils_advocate · ‎05-11-2016

How are you measuring the throughput?

What load balancing method are you using on the 2960G-1 switch?

It may be src-mac and/or src-ip which could be why only one link between the switches is being used.

Run the following command to check the load balancing method:

show etherchannel load-balance

Matthew Millman · ‎05-11-2016

I'm measuring it just by copying files from windows shares.

With the PCs connected to the first switch, I can sustain two 112MB/s streams consistently, so I know that from the Windows point of view, everything is setup correctly.

But when I connect the PCs behind the second switch, that halves down to 2x 56MB/s streams.

Both lights blink equally on the switch to switch link during the copy too.

The result of show etherchannel load-balance is the same on each switch

EtherChannel Load-Balancing Configuration:
 src-mac
EtherChannel Load-Balancing Addresses Used Per-Protocol:
Non-IP: Source MAC address
 IPv4: Source MAC address
 IPv6: Source MAC address

devils_advocate · ‎05-11-2016

The source mac load balancing may be your issue.

Change it on both switches to be:

#conf t
#port-channel load-balance src-dst-ip

Matthew Millman · ‎05-11-2016

Unfortunately I can't try that out at the minute, but to assist my understanding here -

Is the problem that because I have "src-mac", all the traffic ends up with the same hash, because it's coming from the same host, and therefore only uses one of the links between the switches?

Which leads me to my next concern: Let's say I switch to "src-dst-ip" - I've fixed my original case between two PCs and one server

But let's say I'm copying multiple streams between two servers each with a single IP address. I assume this problem will remain?

devils_advocate · ‎05-11-2016

"Is the problem that because I have "src-mac", all the traffic ends up with the same hash, because it's coming from the same host, and therefore only uses one of the links between the switches?"

Yes, my understanding is that it load balances on a per frame basis so frames coming from the server will all have the same source mac, therefore will use the same link, regardless of the fact they have different destinations.

But let's say I'm copying multiple streams between two servers each with a single IP address. I assume this problem will remain?

Yes, I believe that will be the case.

Some catalyst switches have the option of load balancing on higher level protocols like the destination and source port numbers but I am unsure if the 2960G has this ability.

Do a #conf t

Then do the following:

#port-channel load-balance ?

This will show what options you have available, some lower end switches can only load balance on the source or destination mac.

Thanks

Matthew Millman · ‎05-11-2016

I am unsure if the 2960G has this ability.

It appears you are correct. On a 3750 I can go as high as TCP/UDP port, but the 2960G only goes to IP.

Assuming an L2+ switch, the only way around this would be to take the NICs on one of the hosts out of the 802.3ad team, and give them separate IP addresses so that the frames would generate different hashes as they traverse the inter-switch link.

Matthew Millman · ‎05-20-2016

Another wee note to anyone else who hits this problem:

It's not good enough to use src-dst-ip or src-dst-mac alone to guarantee 2.0gbps througput between switches. If the source/dest MAC/IP Address combination of each end interface end up generating the same hash, even though they're different, you'll still only end up with 1.0gbps.

You will, by trial and error, have to change the IP address of one of the PC interfaces until the switches consider them different by hash, only then will you get 2.0gbps.

In my case I found that using src-dst-mac worked better, then changed the MAC addresses of the second NICs until they generated different hashes, because I was running dual stack IPv4 and IPv6, and that meant have to fiddle with the IP addresses on both protocols.

juanruzafa · ‎09-01-2016

Hello all,

I'm trying to solve a related issue with LACP links and bandwidth aggregation, so I've read this topic after trying to find any documentation about this issue.

The problem of my customer is the same as this thread describes, but even worse. In our case there is load balancing of the traffic in all the interfaces belonging to the LACP bundle, but the traffic can't grow over 1 Gbps.

I've searched related documentation but didn't find anything about bandwidth aggregation except this thread.

The production environment has the next devices connected through LACP: ASR-1001, WS-3750X (12 fiber port) and WS-6509E and the load balace algorithm is src-dst-ip in the switches and flow-based in the ASR.

Does anyone found and solved this issue?

Thanks

Matthew Millman · ‎09-01-2016

If your problem is the same as I had, all of your frames are generating the same hash, so the switch is only using a single port of the LACP bundle.

I am not sure what you mean by load balancing, but LACP does not have any concept of load balancing. Instead, each frame get a hash, and the frame is routed down the appropriate port in the bundle.

The hash the frames end up with is determined by how you have configured the switches and what addressing you're using.

juanruzafa · ‎09-01-2016

Hi Matthew,

Thank you for your reply. Let me to clarify that when I speak about load balancing I refer to portchannel load-balancing commands globally applied. And the etherchannel we have in the switches are portchannel interfaces stablished and operating through LACP.

The amazing (to me) of our case is that the traffic is distributed to all ports (4 in the production environment and 2 in a test lab) and the balancing is done, but the sum of the traffic is as much 1 Gbps: 500 + 500 Mbps with 2 ports and 250 + 250 + 250 + 250 Mbps with 4 ports.

Regards

Matthew Millman · ‎09-01-2016

That is the very same command mentioned earlier in this thread, which sets the hashing mechanism.

I think this has already been sufficiently explained in this thread!

juanruzafa · ‎09-01-2016

Hi again,

I posted here because is the only one thread that describes a similar problem than ours. Maybe I didn't explain well before.

The problem we are facing is that the traffic is going to more than one port of the LAG but bandwidth aggregation is not performing well, so we are capable to generate more than 1 Gbps of traffic between the hosts when they are connected trhough direct connection, also when the hosts (that have a bonding to do so) are connected only to the same switch, but the LAG between switches 3750-X and 6509-E does not aggregate traffic over 1 Gbps, only is distributing it between the ports while limiting to 1 Gbps.

In other words, the LAG between switches is only giving HA, but isn't giving bandwidth aggregation (total traffic of the etherchannel below 1 Gbps). While a single switch connected to bonding interfaces of a host is working well and aggregated traffic is over 1 Gbps.

Regards

Matthew Millman · ‎09-01-2016

You have explained yourself very well, and it appears your issue is exactly what was described on this thread.

You cannot get more than 1GBps through a LAG link when the source is a host with an aggregated connection.

Because that host has the same IP/Mac address, all of those frames will have the same hash, they will always travel up the same port on the LAG link, so you will not ever get more than 1GBps.

juanruzafa · ‎09-02-2016

I'm sorry, but we have this problem in a production environment with an aggregated traffic of 4 Gbps from most than 10 public IPv4 class C and a /32 public IPv6 networks making traffic with 2 internet providers. We have two cisco switches (6509 and 3750), an ASR1001 cisco router and a Fortinet FW with a maximum throughput of 8 Gbps.

Then we made a test lab with a different VLAN of both switches, using src-dst-ip hasing method, making iperf traffic from different sources and we saw that the problem is in the cisco switches because the traffic goes through on all ethernet ports of the etherchannel but there is no aggregated traffic between these cisco switches: the sum of the traffic going through all ports is limited to 1 Gbps and not reaches the total aggregated port's capability.

We have the correct configurations, read Release Notes and other cisco documentation finding no reference to this issue of not having aggregated bandwidth in a etherchannel, so I've posted in the forum.

Regards