Re: 3750X Poor Performance - Page 3

Ryan Fisher · ‎10-19-2018

I have a bit of an odd situation. At my DR site, I used to have 2 3560 switches that were port-channeled together. I recently swapped those switches out for 2 3750X switches in a stack, and copied the identical configuration on them as the original 3560 switches.

I have a 1gb WAN connection from my main site that I use mostly for my SAN replication, and with the old 3560 switches, I was able to max out that circuit and push almost a whole 1gb bandwidth. After this swap with the new 3750X switches, I can't get it to pass more than 200mb on that port. Like I said, these switches have the same config on them, so there's nothing new there. There is also no QoS either. I've checked ports for errors, and there are none and they are negotiated properly at 1000-full. I'm out of ideas on things to check, and would greatly appreciate any guidance of things I could look at.

Thanks!

drcore01-3750x#sh int gi1/0/48
GigabitEthernet1/0/48 is up, line protocol is up (connected) 
  Hardware is Gigabit Ethernet, address is 6c20.564d.4ab0 (bia 6c20.564d.4ab0)
  Description: cox 1gb metroE
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec, 
     reliability 255/255, txload 2/255, rxload 49/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
  input flow-control is off, output flow-control is unsupported 
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:01, output 00:00:00, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 142
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 195645000 bits/sec, 17395 packets/sec
  5 minute output rate 11115000 bits/sec, 12417 packets/sec
     317861817 packets input, 444836613365 bytes, 0 no buffer
     Received 164662 broadcasts (161104 multicasts)
     0 runts, 0 giants, 0 throttles 
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 161104 multicast, 0 pause input
     0 input packets with dribble condition detected
     237971247 packets output, 40530223909 bytes, 0 underruns
     0 output errors, 0 collisions, 1 interface resets
     0 unknown protocol drops
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 pause output
     0 output buffer failures, 0 output buffers swapped out

drcore01-3750x#sh run int gi1/0/48
Building configuration...

Current configuration : 190 bytes
!
interface GigabitEthernet1/0/48
 description cox 1gb metroE
 switchport trunk allowed vlan 1,15,501,521,920,980
 switchport trunk encapsulation dot1q
 switchport mode trunk
end

Switch Ports Model                     SW Version            SW Image                 
------ ----- -----                     ----------            ----------               
*    1 54    WS-C3750X-48              15.2(4)E6             C3750E-UNIVERSALK9-M     
     2 54    WS-C3750X-48              15.2(4)E6             C3750E-UNIVERSALK9-M

Georg Pauwen · ‎10-24-2018

Hello,

--> Gi1/0/48 Root FWD 4 128.48 P2p

This means that the root switch is on the other side. Can you issue the same command on the HQ switch ?

Ryan Fisher · ‎10-24-2018

Here you go. Thanks!

SDED01-3750#sh spanning-tree vlan 1

VLAN0001
  Spanning tree enabled protocol ieee
  Root ID    Priority    32769
             Address     0022.0ca9.8900
             This bridge is the root
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec

  Bridge ID  Priority    32769  (priority 32768 sys-id-ext 1)
             Address     0022.0ca9.8900
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec
             Aging Time  300 sec

Interface           Role Sts Cost      Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi1/0/28            Desg FWD 4         128.28   P2p 


SDED01-3750#sh spanning-tree vlan 980

VLAN0980
  Spanning tree enabled protocol ieee
  Root ID    Priority    33748
             Address     0022.0ca9.8900
             This bridge is the root
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec

  Bridge ID  Priority    33748  (priority 32768 sys-id-ext 980)
             Address     0022.0ca9.8900
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec
             Aging Time  300 sec

Interface           Role Sts Cost      Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi1/0/28            Desg FWD 4         128.28   P2p 


SDED01-3750#sh run int gi1/0/28
Building configuration...

Current configuration : 240 bytes
!
interface GigabitEthernet1/0/28
 description cox 1gb metroE net 
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 1,15,501,521,920,980
 switchport mode trunk
 switchport nonegotiate
 speed nonegotiate
end

Georg Pauwen · ‎10-24-2018

I guess you don't know if the old 3560 at the DR site was the root or not...?

Either way, judging from the graphs you posted earlier, you have a lot more outgoing than incoming traffic. Which of the two switches is more central to the network, the 3750 at the DR site or at the HQ site ?

Ryan Fisher · ‎10-24-2018

Yeah, I couldn't say now that it's not connected anymore.

The outgoing traffic is heavy because it's all san replication traffic from HQ to the DR site. I guess you could say the 3750G at HQ is the more central switch, because that's our main datacenter and all the remote sites come back to that. All the remote sites can connect to DR through the metroE, but there's really nothing for them there because everything is hosted at HQ.

Thanks

Georg Pauwen · ‎10-24-2018

In that case it make sense that the HQ switch is the root...

I wonder if the problem is the SAN traffic. Is it possible, for the purpose of testing, to send a 'regular' file of considerable size across the link and see how long that takes ?

Ryan Fisher · ‎10-24-2018

We did do that once already, but I can do it again. I'll let you know.

Thanks!

Georg Pauwen · ‎10-25-2018

Hello,

just to be sure that somewhere in the path MTU is not a problem, I would send a few pings to the HQ site server, with the DF bit set, with different sizes to check at what size packets get fragmented, e.g.:

ping -f -l 1472 192.168.1.1

Ryan Fisher · ‎10-25-2018

We can do that. We're working on doing file copy tests, but have been finding some really odd behavior. One of which, was throughput that was server to server but was contained within one of the UCS fabric interconnects was horribly slow. We failed it to the other FI and it was as expected. So for whatever that reason one FI was messed up. We rebooted them both and now they're functioning fine.

Another thing we're finding, is windows file copies from HQ to DR are extremely painful, as it's showing an iso file going at about 20mb/s. Windows file copies from DR to HQ are much faster, but still slower than they should be at around 200mb/s.

The other thing that's really strange is we're also using iperf, and testing iperf from a windows machine at HQ to a windows machine at DR is showing like 6mb/s, where a linux version of iperf with machines on the same networks are showing anywhere from 200-500mb/s. Those are virtual machines, so we did one from a physical machine, both windows and linux, and the windows machine still was horrible at 6mb/s and the linux machine was showing close to 900mb/s.

We also ran iperf to machines at the remote site, and both windows and linux were able to max out the bandwidth to those places at 500mb/s (that's the size circuit to that site)

So, at this point I really don't have any idea what's going on! But, if that one test from the physical linux machine is correct that it's pushing 900mb/s, then maybe the issue really isn't the switch. It's just that that was the only change made so that's why we're focusing on that.

Thanks for the help

Joseph W. Doherty · ‎10-25-2018

Yes, this is an interesting issue. Difficult to resolve without not working interactively on it.
Also interesting, is the change in your graphs. Assuming you swapped devices at "week 42", your sustained thoughput looks more constant, i.e. without the extreme peeks and valleys.

Ryan Fisher · ‎10-25-2018

yes, that's because the peaks and valleys represent when the replications finish and the bandwidth frees up. Right now they're not finishing because they're not getting full bandwidth, so that's why it looks like it's a steady. But yeah, you can definitely see when that switch was swapped out on that chart.

Ryan Fisher · ‎10-26-2018

So, it appears to be a hardware problem with the switches. We put wireshark on one of the machines at the DR location, and even doing a simple speed test to the internet shows a ton of TCP retransmits, but only on the upload. The download is much better. We also sniffed packets when doing iperf tests between machines that were both on the same switch there, and going between the two switches in the stack. All tests show multiple retransmits which dramatically affected performance. We also disabled global mls qos which actually improved performance on the upload to the internet. Uploads were going about 2mb/s, and after qos was off uploads increased to about 29mb/s.

I'm not quite sure what's going on, but at this point we have decided to go and put the old switches back in until we get new ones, probably 9300-48 ones. The only thing I can think of is that there's some kind of hardware problem on them, because they had three different ios versions on them all of which didn't make any difference.

Thanks everyone for your help.

Georg Pauwen · ‎10-26-2018

Hello,

how old are these switches ? I would in any case get with TAC (if you have a service contract), see what they say...

Ryan Fisher · ‎10-26-2018

I actually just bought them refurbished, so no smartnet. :( The manufacture date is 2012 so they're about 6 years old. Should still work fine, as I'm running Cisco stuff older than dirt. (those 3560's that were out there were much older than that!)

I've had good luck with refurbished stuff in the past, just not this time. You win some, you lost some, I guess.

balaji.bandi · ‎10-26-2018

you still get better one in the market, most of the vendor refurb support if you are loyal customer replacements.

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Ryan Fisher · ‎10-26-2018

Any new ones I buy (9300) are going to be legit new ones.

Thanks