Slow performance over network

achristian1221 · ‎08-29-2017

I have a bunch of users getting horrid transfer speeds between locations. Here is the breakdown of the network.

All locations are using 7018 switches running VPC and one major revision behind on code.

3 locations:

A

B

C

Location A is connected to B via 4 X 10 gig point to points. ECMP is in use via OSPF to the other pair in location B. The connections are 2 per switch (A1-B1 X2 and A2-B2 X2)

Location B and C are connected to each other via 14 total 10 gigabit point to points, or 7 X 7 also running ECMP via OSPF. These connections are a little better as it is 4 to one switch and then 3 to the other from B to C. Here is a show ip route between the 2 for reference:

10.91.0.0/24, ubest/mbest: 7/0
*via 192.168.53.26, Eth2/46, [110/44], 6w6d, ospf-1, intra
*via 192.168.53.114, Eth1/46, [110/44], 6w6d, ospf-1, intra
*via 192.168.53.130, Eth1/48, [110/44], 6w6d, ospf-1, intra
*via 192.168.53.134, Eth4/12/1, [110/44], 6w6d, ospf-1, intra
*via 192.168.53.138, Eth4/12/2, [110/44], 6w6d, ospf-1, intra
*via 192.168.53.150, Eth4/12/3, [110/44], 6w6d, ospf-1, intra
*via 192.168.53.154, Eth4/12/4, [110/44], 6w6d, ospf-1, intra

From those connections, the servers are all sitting on any combination of 5548, 5672, and 5696 switches and at 10 gigabit. All servers are connected via LACP and VPC.

Using Solarwinds, and netflows on the switches as well as show interface, none of the point to point connections are going over 48% utilization at any time. CPU and memory are also low on the 7k switches, maxing around 30% over historical data.

Here is where the issues start. There are two main users on the network, Avid and Scality. They both transfer massive amounts of data. Transfers internally are fine and they can pin the 10 gigabit interfaces to capacity. So, transfers that remain in location A will be fine. Transfers from B to B and C to C will also be fine. Once we traverse other locations we decline and do so rapidly.

Here is an A to A transfer:

hsepl-srnode-04.eng.homebox.com hsepl-admin01.eng.homebox.com 170815 09:24:32 1100 ***********

hsepl-srnode-05.eng.homebox.com hsepl-admin01.eng.homebox.com 170815 09:24:51 1100 ***********

hsepl-srnode-13.eng.homebox.com hsepl-admin01.eng.homebox.com 170815 09:27:20 1100 ***********

Here is a C to A transfer:

hc2pl-srnode-12.eng.homebox.com hsepl-admin01.eng.homebox.com 170815 09:27:01 379.2 ***

hc2pl-srnode-16.eng.homebox.com hsepl-admin01.eng.homebox.com 170815 09:28:16 380.3 ***

Lastly, A to B

hsepl-admin01.eng.homebox.com hc1pl-srnode-08.eng.homebox.com 170815 09:31:22 677.7 ******

As you can see, the A to A is fine. All Scality is based off the Cisco CVD minus jumbo frames. All of the storage servers are connected to the 5696 switches at this time.

The speeds for Avid are even worse at times and are more disparate, with speeds going down to 60 mb/s and up to 700 at times. This is also not consistent as far as time of day or anything like that. A test run today could be done at 9AM and get 700 mb/s and the next day it is sitting at 60.

Steps that have been taken so far to resolve this:

Added new F3 line cards to 4/6 7018. This did nothing but the previous staff indicated that this was the fix all for everything and arguing it was not an option.

Added 8 x 10 gigabit point to points using the new F3 line cards. Result was that I went from 1.2 million input discards to about 30k.

Migrated all Scality and Avid devices over to the 5696 switches. Results were that immediately, we could see that the trunks between those switches and the 7018 F3 cards started passing traffic and using approximately 45% of the bandwidth over the 40 gig twin-ax connections. Moving all of these connections from the 5548 switches I also saw the discards drop to almost nothing, save for a few hardware related issues that were addressed as they come up.

Thoughts, direction, etc? I would pull all my hair out with what I walked into here but I am bald already. Thanks.

dperezoquendo · ‎08-29-2017

Hello,

How and what is the connection between sites? Are you using QoS at all? What about your Service Provider? I've seen some Service Provides implement policers before that could essentially be dropping traffic.

achristian1221 · ‎08-30-2017

Service Provider is Lightower. There is a total of 4 point to point connections between A and B.

QOS is in place but on a limited nature. From the 5K switches, the COS value is changed from 0 to 5 to give it priority.

There is no rate limiting/policing on the circuits. This was verified from the provider and tested. We have been able to burst to roughly 120% of the circuit.

dperezoquendo · ‎08-30-2017

What kind of QoS is emplaced? Do you see any suspicious drops on the service-policies? If you're able to replicate the issue, could you try clearing the counters on the service policy and then check for drops?

Also, would it be possible to view the configuration for the point-to-point connections?

achristian1221 · ‎08-30-2017

Here is the QOS on the 5k switches:

ip access-list QOS
10 permit ip (subnet) (subnet)

class-map type network-qos mutation
match qos-group 5
!
!
policy-map type network-qos mutation
class type network-qos mutation
!
set cos 5
mtu 1500
class type network-qos class-default
!
mtu 1500
multicast-optimize
!
!
!
class-map type qos match-all classify_traffic
match access-group name QOS
!
policy-map type qos classify_traffic
class type qos classify_traffic
set qos-group 5
!
!
!
interface (port-channel###)
service-policy type qos input classify_traffic
!
system qos
service-policy type network-qos mutation

505# show policy-map interface ethernet 2/1

Global statistics status : enabled

Ethernet2/1

Service-policy (queuing) input: default-in-policy
SNMP Policy Index: 301992145

Class-map (queuing): in-q1 (match-any)
queue-limit percent 50
bandwidth percent 80
queue dropped pkts : 0
queue dropped bytes : 0
queue transmit pkts: 0 queue transmit bytes: 0

Class-map (queuing): in-q-default (match-any)
queue-limit percent 50
bandwidth percent 20
queue dropped pkts : 0
queue dropped bytes : 0
queue transmit pkts: 0 queue transmit bytes: 0

Service-policy (queuing) output: default-out-policy
SNMP Policy Index: 301992154

Class-map (queuing): out-pq1 (match-any)
priority level 1
queue-limit percent 16
queue dropped pkts : 0
queue dropped bytes : 0
queue transmit pkts: 0 queue transmit bytes: 0

Class-map (queuing): out-q2 (match-any)
queue-limit percent 1
queue dropped pkts : 0
queue dropped bytes : 0
queue transmit pkts: 0 queue transmit bytes: 0

Class-map (queuing): out-q3 (match-any)
queue-limit percent 1
queue dropped pkts : 0
queue dropped bytes : 0
queue transmit pkts: 0 queue transmit bytes: 0

Class-map (queuing): out-q-default (match-any)
queue-limit percent 82
bandwidth remaining percent 25
queue dropped pkts : 0
queue dropped bytes : 0
queue transmit pkts: 0 queue transmit bytes: 0

dperezoquendo · ‎08-30-2017

So that QoS policy is just marking traffic, doesn't appear to be doing anything else. I assume the 5k's are acting as the leaf/access switches and the 7's are the distro/spine switches in your environment. If the 5k's are marking traffic, it is possible that there may be some sort of policer or shaper on the 7k's.

Considering the issue only lies within the communication between locations and is intermittent, I can only think of investigating QoS, ECMP, and the interconnects themselves at this time.

QoS - Verify if a policer/shaper exists and if drops appear to correlate with your slow performance. Clear counters and replicate issue to observe if drops increment.
ECMP - Ensure you're using per-flow/destination load-balancing, which is default, not per-packet as that can potentially cause issues. Verify max-paths command is applied to ospf process. I believe default is 4 and it appears you should have it set to probably 8
Interconnects - Verify interface outputs for any suspicious drops/error. Clear counters and replicate issue to observe if errors/drops increment. Would also try to see if the bandwidth between locations is sufficient with the current implementation. If there is no QoS implemented, or if it is very limited, maybe there may be a need to implement a more in-depth QoS configuration?

I also reread your initial post again and it sounds like the issue occurs on servers based off the 5548 switches not the 5696? Is this true? If so, this may cause for a bit of investigation as well.

Without knowing more of the configuration, I can't think of anything else. Someone better may need to assist in this discussion if interested. I'll still try to help the best I can though.

achristian1221 · ‎08-30-2017

Thanks for the follow up.

QOS: this was implemented based on a TAC case with Cisco. I had wanted to implement it based off VLANs as the VPC connections from the 5k switches to the 7k switches are just that, layer 2. The goal was to mark it as a COS of 5 which will have the highest priority on the 7K switches since it falls into the priority queue.

ECMP: load balancing by default is 4, max is 8. Between A and B, there is effectively only 2 per switch, with a total of 4 between the pair. From B to C, there are 14, or a total of 7 per switch. A show ip route does validate all paths are there as they should be. The per-flow/destination load-balancing is still in place as no other values have been changed on the configuration.

Interconnects: As we find them, we correct them. I have been onboard for about 5 months now and have taken it from 1.2 million discards to roughly 15k. Some of it was buffers. Some of it was a result of faulty hardware. Once verified, it is addressed. The main reason I moved most of the heavy hitters over to the 5696 switches were the buffers as well as the node count on those switches. I only had 6 servers (all 40 gig) connected to each pair of 5696 switches. As far as bandwidth being sufficient, I am only sitting at around 48% at any given point in time during the day on all links. I have never seen any of the links between A and B go over 60%. True, QOS is limited at this time and the only ones falling into it are the heavy hitters, Avid and Scality.

I also reread your initial post again and it sounds like the issue occurs on servers based off the 5548 switches not the 5696? Is this true? If so, this may cause for a bit of investigation as well.

^^^

Most of this is a true statement. When first implemented, the 5548 switches were apparently great. Over time, they oversubscribed the switches with a combination of 1 and 10 gigabit FEX, with some of them housing 5 10 gig fex. With some of the limitations of the 2232 FEX, devices have either been directly connected to module 1 or 2 on the 5548 or moved to a 5696. I will shortly be installing 5 pair of 5672 switches as well to further remove more of the 5548 switches.

Unfortunately, as I have said, I have been around here for about 5 months now. This is something that has degraded over years and the improvements I have already made are only a drop in the bucket. I have finally hit that point in changes that we are not really seeing any more improvements which is a major issue as we will be closing one team from location A and moving them to B and C. Once this starts, it will be 24 hour transfers until all data has been moved.

dperezoquendo · ‎08-31-2017

Hi Christian,

Great reply, I appreciate it. Unfortunately I don't think I'll be able help to you. Only other thing I can recommend would be to conduct some analysis via wireshark during times of slow performance, if you haven't done so yet.