cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3150
Views
5
Helpful
16
Replies

Cisco 3650 packet loss with Trunk port but not Access port

Dan H
Level 1
Level 1

Hello!

Not sure if anyone have experienced similar issue. We have 10 stacks of Cisco WS-C3650-24TD switches throughout our office, each stack has 2 x 3650 switches. The uplink of the 3650 stack connects to a pair of Cisco 4500X (running VSS) with 2 x 10Gbps LACP. The downlinks of the 3650 stack connects to multiple Aerohive SRS2148P switches with 4 x 1Gbps EtherChannel, same configuration applies to all 10 stacks of Cisco 3650 switches, basically the Cisco 3650 stack switches are running as Distribution layer switches, Cisco 4500X is the core layer and Aerohive switches are the Access layer.

Both uplink and downlinks of the Cisco 3650 switches are running trunk over the EtherChannel, so the Cisco 3650 is being used as a layer 2 switch, no routing at all, basically switching network traffic between the 4500X Core and Aerohive Access switches over multiple VLANs.

Network topology:

(Cisco 4500X VSS)--2x10Gbps--(Cisco 3650 stack)--4x1Gbps--(Aerohive switches)

This year we discovered very high number of output packet loss on the Cisco 3650, on all the interfaces connecting to the Aerohive switches, no packet loss on the 10Gbps interfaces towards the Cisco 4500X. Initially we thought maybe its the overkilled QoS design (1P+7Q), therefore we increase the buffer to 1200 which didn't help, we then removed QoS from all the interfaces, however this didn't resolve the issue. Long story short, in the end we discovered the root cause is when the interface connecting to Aerohive switch is configured with trunk.

In our test, we configured a LAB Aerohive switch. we removed EtherChannel configuration between the Cisco 3650 and the Aerohive switch, testing with only 1 single 1Gbps RJ45 cable. The results are:

0% packet loss if the link is configured with standard Access port.

interface GigabitEthernet1/0/23
switchport access vlan 7
switchport mode access
spanning-tree portfast
spanning-tree bpdufilter enable

 

Maximum of 0.28% packet loss if the link is configured with Trunk port and network traffic (VLAN 7) is running over the tagged VLAN.

interface GigabitEthernet1/0/24
switchport trunk native vlan 3
switchport mode trunk

 

Approx. 0.02% - 0.03% packet loss if the link is configured with Trunk port and the network traffic (VLAN 7) is running over the untagged VLAN (native).

interface GigabitEthernet1/0/24
switchport trunk native vlan 7
switchport mode trunk
end

However, if we replace the 1Gbps link with 10Gbps link between Cisco 3650 and Aerohive, then we have 0% packet loss!!

Configuration of the 2x10Gbps uplink to the Cisco 4500X on the Cisco 3650:

interface Port-channel111
description uplink to 4500X
switchport trunk allowed vlan 1,3,7,66,68,69,116,703,707,714,991,996
switchport mode trunk
switchport nonegotiate
spanning-tree portfast trunk

Above tests were done using 2 x HP servers with 1Gbps NICs running Ubuntu server OS, using iperf3 with UDP packets and packet size 1400 and lower. 1 HP server running as iperf server connects to the Aerohive LAB switch (it is the only device on the Aerohive switch), 1 HP server running as iperf client connects to another stack of Cisco 3850 (not mentioned above) behind the Cisco 4500X core switch. We are confident no issue on the Cisco 4500X and Cisco 3850 stack, the root cause is between the Cisco 3650 and Aerohive, especially the Cisco 3650 output to Aerohive.

Topology during the test:

(HP server iperf3 client)--1Gbps--(Cisco 3850 stack)--3x10Gbps--(Cisco 4500X VSS)--2x10Gbps--(Cisco 3650 stack)--1Gbps--(Aerohive LAB switch)--1Gbps--(HP server iperf3 server)

Lastly, we have also tried upgrading the firmware and all the following version of firmware o the Cisco 3650 doesn't resolve the issue:

03.06.06E

16.03.07

16.09.08

16.12.08

Does anyone have experienced same/similar issue? Or any suggestions? Please help!

Thank you in advance.

 

1 Accepted Solution

Accepted Solutions

I've used iperf a little bit but for most of my bandwidth testing, but I often found PCATTCP, using UDP packets, sufficient for my needs.  I.e. I'm not an iperf expert.

Anyway, back to the your observed results, your iperf source is generating a gig's worth of traffic while attached to a gig access port?  Or, you'll telling iperf to generate a gig's worth of traffic on a gig or 10g access port?  And if correct, this traffic, when it goes to/across a gig trunk port loses a small percentage of packets?  Also, you're using a MTU of 1400?

If so, "But it's hard to believe Cisco 3650 can't handle additional standard trunk port header and trunk traffic." that, yes, it's hard to believe because it's unlikely.  Again, realize an access port has more "payload" bandwidth than a trunk port (no trunk overhead protocols, no VLAN tagging overhead).  Tagging VLAN overhead, as a percentage, increases as frame size decreases.  (I.e. you might try same tests with minimum size packets and see if drop rates increases (this would create tagged frames with the highest percentage of overhead).  And/or try 1500 byte frame sizes and see if drop rates decrease a [very] small amount.

Hmm, a .1Q tag adds 4 bytes, so if your IP MTU is 1400, standard frame would be 1418 and tagged frame would be 1422, a difference of 0.282% - percentage look familiar?

If you're in the situation of sending more "payload" than a trunk link physically has bandwidth for, eventually, packets will be dropped.  Size of buffers only determine how long before drops occur, and/or perhaps avoid them for short duration bandwidth oversubscription bursts.

Miercom has a test report for the 3650/3850: https://miercom.com/pdf/reports/161201G.pdf, but this doesn't address port performance.

Miercom also has another test report for just the 3850: https://miercom.com/pdf/reports/20150225.pdf, which does address port performance.  Unknown whether same would apply to 3650, but good chance it would (as I suspect a 3650 is a reduced 3850).

View solution in original post

16 Replies 16

Leo Laohoo
Hall of Fame
Hall of Fame

Post the complete output to the command "sh interface Gi1/0/24".

Hi Leo,

Following are the CLI output of the "show interface g1/0/24". I physically disconnected the Aerohive LAB switch.

 

GigabitEthernet1/0/24 is down, line protocol is down (notconnect)
Hardware is Gigabit Ethernet, address is 0078.8875.af98 (bia 0078.8875.af98)
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Auto-duplex, Auto-speed, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 22:28:04, output hang never
Last clearing of "show interface" counters 1d00h
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 905829371
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
7120757 packets input, 10042794252 bytes, 0 no buffer
Received 343 broadcasts (293 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 293 multicast, 0 pause input
0 input packets with dribble condition detected
48276556 packets output, 68344913669 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

 

Thanks,

Dan

 

 


@Dan H wrote:
Total output drops: 905829371

Wow.  That is eye-popping "not right".  

Unlike the previous generation, IOS-XE has a QoS and works well if it is left alone.

All those network traffic were generated by me using iperf within couple of hours of testing.

We get drops everyday, the following output is from one of the 3650 stack, probably has the highest network traffic, however counters were cleared about 6 weeks, the office hardly had any users during the Christmas and New Year period:

show interfaces | in drops
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 1/75/56351/0 (size/max/drops/flushes); Total output drops: 0
2676146092 unknown protocol drops
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 1319137941
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 4268856987
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 6061782
18 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 6305150
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 4026183115
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 158288
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 108062
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 824009326
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 38953609
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 542865723
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 4181920
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 6511917
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 568490825
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 9736859
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 41275676
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 618895951
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 707730
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 5262136
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 27668866
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 566904306
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 649894
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 2421069
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
4398 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 1650624453
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 1934249454
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 90686516
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 18584410
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 633550215
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 34751312
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 256410958
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
0 unknown protocol drops

 

Thanks,

Dan

what is the outcome if you remove 3650 from the path and test it ?

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Removing Cisco 3650, do you mean connect Aerohive directly to the Cisco 4500X Core switches?

yes -

(HP server iperf3 client)--1Gbps--(Cisco 3850 stack)--3x10Gbps--(Cisco 4500X VSS)--1Gbps--(Aerohive LAB switch)--1Gbps--(HP server iperf3 server)

 

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Never tried that, will need up to a week to set it up, because the Cisco 4500X is managed by our vendor. I won't able to try that until end of Jan, as I am going to go on leave next week.

check the MTU size in both end, 
the vlan tag add more overhead to frame and you use different SW in both end so check the value of MTU, 
try match it. 
hope this solve issue here

Hi,

Thank you for your input, I also thought it could be MTU related, so I triple checked the MTU setting on both Cisco 3650 and Aerohive switches, they are all running with default MTU 1500. I also tried changing from 1500 to 9000 on both ends, but it made no difference.

Another thing is, as mentioned above, no packet loss if I change the 1Gbps connection with 10Gbps connection, and 10Gbps connection was running with default MTU 1500. Lastly, my testings were 1400 packet size and below, therefore not hitting the maximum MTU. So I concluded the issue is not related to MTU.

 

Thanks,

Dan

rr2926981
Level 1
Level 1

That's what never attempted, will require as long as seven days to set it up, in light of the fact that the Cisco 4500X is overseen by our merchant. I will not ready to attempt that until end of Jan, as I will go on leave one week from now.

Joseph W. Doherty
Hall of Fame
Hall of Fame

Possibly your iperf test results are due to tagged frames using bandwidth and a trunk using bandwidth (e.g. DTP, BPDU, etc.) not used on an access port (i.e. even a trunk's native/untagged VLAN has a tad less bandwidth to use than an access port).

My prior post addresses, possibly, why you got the iperf test results you did.

As to, the operational issue, your numerous downlink port drops, that's most likely just due to oversubscription of your 3650's downlinks.  Which, considering, upstream, you note, using 10g links, this is "classical" large to small congestion.

Can the operation issue be mitigated?  This would be an "it depends" answer.

First, assuming I'm correct this is basically due to bandwidth oversubscription, you would need to determine if this is sustained oversubscription or burst oversubscription.  Some mitigation approaches are more relevant for one form of oversubscription than the other.

Second, mitigation could be much limited by the features and architecture of the 3650.  Unfortunately, for this discussion, I have no direct hands-on experience with the 3650 or 3850, and so, I'm not very familiar with what you can do with one, versus, for example, the prior 3560/3750 series.

That said, I believe (?) the 3650/3850 series share some traits with the earlier series, such as much less port buffering resources compared to, for example, your 4500X.

I think they also might, somewhat, share a similar buffer resource default allocation management, to the 3560/3750 series, i.e. buffers tied/reserved to ports and egress queues.

I will say, I had a case of a 3750G having lots of drops per second on two of its ports which had SAN appliances on them.  Changing buffer allocations, I was able to reduce those two ports to just having several drops per day without any impairment to the other active ports.  (NB: your mileage, err drops, might vary.)

I mention the forgoing (the 3750G case) as, sometimes, some drop mitigation can be very, very effective, without imposing additional hardware costs.  However, not truly knowing what's happening in your environment, cannot say what, if anything, would mitigate your drops, but, they very likely might be mitigated, although, possibly, not inexpensively.  (The Catalyst 3K switches are more oriented for being used as edge switches rather than distribution switches.)

Hi,

Thank you for your detailed sharing. Would like to share a more information with the iperf3 testing I did.

Not sure if you ever used iperf3 for testing, but the iperf3 usually run 10 tests by default unless you specify total number of tests, anyway.

 

As mentioned above, when the link was configured as trunk and the network traffic (VLAN7) was a tagged VLAN, I was getting maximum of 0.28% packet loss. However the packet loss started with 4th test and 4th test always had less than 0.28% of packet loss , and from the 4th test onwards, all the following tests were getting 0.26%t to 0.28% packet loss until the second to last test, and the last test was usually 0% packet loss.

 

When the network traffic (VLAN7) was running over the untagged VLAN (native), I was getting approx. 0.02% - 0.03%. The packet loss started just before the 30th test or sometimes just after 40th test, however once the packet loss started, the packet loss continues until the second to last test, and same as above, the last test was usually 0% packet loss.

 

It does feel like buffer related, but it’s hard to believe switch like Cisco 3650 (I personally sees it as a powerful switch) can’t handle additional trunk network traffic.

Instead of believing in the issue is caused by buffer, I prefer believing in other possible root causes and would like know if any other Cisco 3650 owners having similar issue with their 3650 switches, I believe there must be many users configured trunk port on their Cisco 3650, hence started this post.

 

Thanks,

Dan