Solved: Re: Extremely slow WiFi Performance over some L3 routed links

Jan Gilhooley · ‎12-19-2019

Hi,

This is a long one - so bear with me....

This issue has been bugging me for a while now. We have a pair of Cisco 5520 Wifi controllers in a SSO configuration (just upgraded both to 8.5.151.0) and approx 300 APs (mainly a mixture of 3702 & 3802's) over a number of sites and running Centrally Switched SSIDs (i.e. we tunnel everything back to our main site). Our main campus network is L2 switched, and we have some remote sites over L3 routed links. Wifi works normally over all these links. The problem is a site belonging to another organisation that is over an internal L3 routed network - at this site the APs get around 0.2mb/s download and 6mb/s upload! Far too slow to be usable. Wired connections are absolutely fine, and as is trying to use FlexConnect and break out traffic locally (there are reasons why this isn't the "proper" solution). Its just the WiFi clients that experience this slow performance.

For troubleshooting I've created a test switch connected straight to our Cisco 6800 Distribution Switch with a L3 routed link over Cat5 cable and similar config as the "issue" site - and when I connect up an AP to the test switch I also get the same poor performance - so its definitely something in our network that is doing this.

So far I have considered:

1) Wifi Radio Interference - (everyone says this first!) unlikely given that the FlexConnect works normally on the APs and my test AP works normally when in its normal switch and has slow performance when connected to the test switch located in the same room.

2) Routing - cant see any evidence of asymmetrical routing or other routing "issues". Traceroutes from APs/PCs/Switches all go to where I would expect them to go.

3) QoS - we do need to sort out QoS across the Edge/Distribution/Core - but I cant see any packet drops anywhere that would account for the poor performance. And network bandwidth is fine so QoS normally wouldnt get involved. The rates we get do kind of look like a rate limit of some description is being applied - but where? And why would it just be L3 routed?

4) Packet Size - the MTU is set to the standard of 1500 bytes on the switches, and the Path-MTU on the APs is 1485, but from packet captures it looks like the "Do not Fragment" bit has been set. From packet captures of working APs I can see that there is some fragmentation going on - one packet has the full payload, and the fragmented on is around 62 bytes (which is mostly header as far as I can see). I'm aware that there is the CAPWAP tunnel overhead, but that the AP should be able to negotiate with the WLC for the maximum MTU. Its almost like "something" is adding a few bytes to the packets somewhere between AP and WLC that messes up the negotiated maximum MTU. I personally still think this is the most likely cause somehow.

5) Fast Packet Switching not happening - the Cisco 6800 (and the Cisco 3850's the APs are connected to) all support Fast Packet Switching - so if the packets from the APs aren't being processed at a port level but sent to the Switch's CPU instead we might have slow performance there. However I cant see that this is happening, and if it does then why just routed packets?

I've been going slowly crazy looking into this one - and what is frustrating is nothing I have tried has had any effect, good or bad, on the Wifi performance rates. So I'm still none the wiser as to what is going on, and what the root cause might actually be.

Any clues for things to go and look at would be very gratefully received! I have the sense I might have overlooked something really obvious here

Thanks

Jan

vb10 · ‎12-20-2019

Hi Jan,

Thanks for update. Actually, the main place, where "no ip redirect" should be configured - it's 6800 Core switch, interface which has ip address 10.180.10.254 (VLAN200, I believe)

Sorry, probably, I was not clear enough, explaining the scenario. And after your comments I understand the situation better. So, as I understand, that AP IP 192.168.50.10, when connected to sw-man works fine, right? Problem is only, when AP is connected to sw-test.

I believe, that following is happening:

A) Traffic flow between AP and WLC in working scenario, when AP is connected to sw-man, has IP 192.168.50.10.

1. AP sends packet to WLC via default gateway. Gateway for this VLAN 50 (or55) is 6800 VSS switch VLAN50 (or 55), doesn't matter

2. 6800 switch receives this packet on VLAN50, and sends to WLC 10.180.10.250 in VLAN200.

3. Return traffic goes in the same way. 6800 switch,as default gateway, receives packet from WLC in VLAN200 and forwards it to AP 192.168.50.10 in VLAN50.

4. Everything is working fine here

B) Traffic flow with bad perfromance, whne AP is connected to switch sw-test ahd has IP 192.168.2.10

1. AP sends packet to default gateway, which is sw-test. Then packet goes to sw-man switch, and then from L3 perspective goes directly to WLC, because sw-man has inteface in this subnet 10.180.10.0/24

2. From L2 perspective, this traffic goes switched through 6800 VSS. This is also fine, doesn't cause any issues.

3. When WLC sends return traffic to AP, it's when issue happens. WLC sends packet to default gateway, which is 6800 switch, IP 10.180.10.254 VLAN200. 6800 needs to send it further to AP 192.168.2.10. 6800 VSS doesn't have interface in this subnet, instead, this route most likely is available via sw-man, 10.180.10.10. It's also VLAN200. So, 6800 switch receives packet from WLC in VLAN200, and sends it via VLAN200. That's when ICMP redirect is generated and software switching is happening.

There are couple of more evidences:

1. Software switching is happening on direction from WLC to AP, meaning in "download" direction from WiFI client point of view. That's why you saw download speed less than upload

2. When you moved subnet directly to 6800 VSS, you avoided same inteface forwarding, that's why it started to work.

I'm pretty sure, that if you will configure "no ip redirect" VSS 6800 VLAN200 (IP 10.180.10.254), it will start to work fine, even with previous design.

Could you please confirm?

Resources (though, you can find a lot of others on Cisco site)

Regading ICMP redirect:

https://www.cisco.com/c/en/us/support/docs/ip/routing-information-protocol-rip/13714-43.html

Regarding CoPP:

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/15-1SY/config_guide/sup2T/15_1_sy_swcg_2T/control_plane_policing_copp.html

View solution in original post

vb10 · ‎12-19-2019

Hello,

If problem is seen only over L3 links, than it really might be related to software switching or MTU issues.

1. One of the most common reasons of software switching is same-interface-forwarding scenario. When traffic arrives and needs to be forwarded over the same L3 interface. You can check, that following addresses from 6800 are reachable via different L3 interfaces(and also, that traffic from those addresses arrives on appropriate interfaces)

1. AP address, connected to your test switch

2. Wi-Fi client address, connected to test AP

3. IP address of WLC

4. IP address of some host in your network (that one, which you probably used to measure performance)

2. There might be other reasons of software switching (IP optins, TCAM utilization, unsupported features on interfaces). You can check traffic rate to CPU (show ibc) command. Usually, it should be quite low. Also, perform netdr capture on 6800, which will show you, what exactly traffic goes to CPU. If there will be your CAPWAP traffic, reason needs to be investigated.

Generate some traffic to/from Wi-FI client, and during that time:

start capture:

debug netdr capture rx

Capture will start and collect 4096 packets from CPU to special buffer

Display the buffer:

show netdr capture

You will see packets information (MACs, IPs, input interface):

https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116475-technote-product-00.html

3. Perform SPAN from test switch, from AP connected port. Do you see any fragments there? Then you can also perfrom SPAN from 6800, WLC connected port and compare results.

Jan Gilhooley · ‎12-19-2019

Hi,
Thanks for that - I shall have a look at that in more detail - it might help this discussion if I upload a diagram of what is connected to what (with suitable redaction of course!). One point I've noticed is when you say "show ibc" and its "usually low" - when I've just run the command on our 6800 I get these figures (which seems high to me):
Interface information:
Interface IBC0/0/0(idb 0x53B061E8)
5 minute rx rate 3218000 bits/sec 1275 packets/sec
5 minute tx rate 3255000 bits/sec 2920 packets/sec
95482663501 Packets input, 38160583111013 bytes
68373151656 broadcasts received
219935703685 Packets outputs, 30768899458297 bytes
2373 broadcasts sent
7 Inband input packet drops
0 Inband output packet drops
2 IBC resets

I've not seem these before - would this be classed as "high" then?

Jan

vb10 · ‎12-19-2019

Hi Jan,

Exact number highly depends on particular network baseline. How many hosts are attached , what protocols are running, etc. Because this numbers show everything, which goes to CPU (ARPs, pings, routing protocols, etc.)

I would say 1275 and 2920 packets/sec is not very high value, if network is big.

5 minute rx rate 3218000 bits/sec 1275 packets/sec
5 minute tx rate 3255000 bits/sec 2920 packets/sec

High value could be something above ~5000 packets/sec.

But still, there are several options, which could limit this traffic to CPU (drop excessive traffic), it's mainly CoPP (control-plane policing) and rate-limiting (it's not QoS, it's 6k platform feture, which limits control-plane traffic). In this case, traffic to CPU will not be very high in this output, but still, if some traffic needs to be punted to CPU, it might be dropped by those features.

The best thing is to perfrom netdr capture during Wi-Fi test to see, if there are any packets, related to CAPWAP.

Yes, diagram could help, if it will show how L3 segmentation and routing is done. Aslo, if you can, please send "show ip route <IP>" output for those IP, which I mentioned

Jan Gilhooley · ‎12-19-2019

Its a redacted diagram I'm afraid - I've had to change all the IP addresses and such as a "just in case" :-) But it will give an indication of the connections.

Edit: I shall run the various checks tomorrow and post the results.

patoberli · ‎12-19-2019

I'd first test lower MTU or actually MSS for the APs.

Please see this article on how to do this:

https://mrncciew.com/2013/04/07/configuring-tcp-mss/

I'd lower it by 72 for a start.

Jan Gilhooley · ‎12-20-2019

@patoberli: I had seen that article (one of my many, many google searches on this subject!). I have tried various adjustments (everything from 576 to 1363) of the MSS on the affected APs with no discernible affect. I have also tried to lower the actual MTU on windows wifi clients - but again with no discernible effect (and one of the main clients are iPads - and I know of no way to manipulate the MTU on those :-) ).

patoberli · ‎12-20-2019

Ok, then it's some other issue. I hope the others can find the cause.

Jan Gilhooley · ‎12-20-2019

Thank you for your input.

Leo Laohoo · ‎12-19-2019

Wait ... The switches are 3850?
What firmware are they running on?
Post the complete output to the command "sh interface <AP port> controll".
NOTE: Put the output in a Notepad and attach it.

Jan Gilhooley · ‎12-20-2019

@Leo Laohoo: The switches are Cisco 3850 (and the 6807) - firmware details are:

Sw-Test	WS-C3850-12X48U-S	16.9.1
Cab-Man	WS-C3850-12X48U-S	16.9.1
CORE-VSS	C6807-XL	15.2(1)SY0a

See attached for the output of sho int gi1/0/2 controller - note that its showing as "down" as I've moved it back to it's correct location for now. Also I'm going to post an update - but moving the routed link between sw-test and sw-man to core-vss instead has corrected the problem. The question is now of course "why".....

Leo Laohoo · ‎12-20-2019

@Jan Gilhooley wrote:

GigabitEthernet1/0/2 is down, line protocol is down (notconnect)

Not helpful.
I want to know the following:
1. Is QoS configured correctly or not - QoS is MANDATORY with 3650/3850. Without QoS, there will be output drops.
2. 16.9.1 - Not a stable OS. Try 16.9.4.

vb10 · ‎12-19-2019

Hello,

Thanks for diagram. Can you clarify couple of questions:

1. What is the mask of network 10.180.10.x? Is it /24?

2. What device is default gateway (or next-hop) for WLC to reach AP subnet? Is it 6800 switch 10.180.10.254?

3. What is default gateway for Access Point(192.168.50.10)? Is it 3850 sw-man switch?

4. How AP network 192.168.50.10 is reachable from 6800 switch? Via 3850 sw-man switch 10.180.10.10?

If answers to those 4 questions are as I wrote above, then I see the problem, which I mentioned at the very beginning, related to same-interface-forwarding. In particular, traffic in direction from WLC towards AP can be received and sent over the same interface 10.180.10.254 on 6800 switch. This causes software switching due to icmp redirect mechanism (if it's not disbled). When switch receives packet on some L3 interface, and forwards it to next hop via the same interface, it needs to generate icmp redirect message. In order to do this, this packet is punted to CPU, causing software switching. Apart of the fact, that software switching is much slower, some traffic to CPU might be dropped by CoPP or rate-limiters, causing the issue.

You can try to disable icmp redirects on 6800 interface with IP 10.180.10.254.

conf t

interface vlan<x>

no ip redirect

And also, it's better to disable it on 3850 as welll, on the same interface.

However, if I misread the diagram, then it might be some other cause, and we need to investigate it further. Please, confirm.

Jan Gilhooley · ‎12-20-2019

@vb10 - thanks for these updates. I feel we are closer to understanding the situation when we were before. I had another go moving the AP from its usual location on sw-man to sw-test, but adding the "no ip redirects" command to the VLAN100 interface on sw-test, interface gi1/0/1 on sw-test and gi/0/35 on sw-man. Testing showed exactly the same performance as before. So either it wasn't icmp redirects, or I'm missing something, somewhere (more of that below).

Answers to your questions:

1. What is the mask of network 10.180.10.x? Is it /24? Correct - its /24

2. What device is default gateway (or next-hop) for WLC to reach AP subnet? Is it 6800 switch 10.180.10.254? Correct - the 6800 switch has the VLAN IP Gateways for this VLAN (which is actually the management VLAN for all switches on the main site).

3. What is default gateway for Access Point(192.168.50.10)? Is it 3850 sw-man switch? No - when the AP is connected to sw-man its on a dedicated AP VLAN, whose default gateway is also on the 6800 switch.

4. How AP network 192.168.50.10 is reachable from 6800 switch? Via 3850 sw-man switch 10.180.10.10? I think so - when the AP is on sw-man its a member of a VLAN for all APs.

I've tried to understand the issue with ICMP redirects - I understand its a L3 concept so on my diagram only sw-test, & sw-man are running routing protocols. The connections between sw-man, core-vss (the 6800) and the wlc are all L2 trunk links. So when I think about packets flowing from the AP to the WLC I can't see which interface packets go in and out of. Unless of course we are looking at the VLAN IP Interface. sw-man has an IP Interface for VL100, and there is the management VLAN200 on sw-man and sw-vss that has IP interfaces (10.180.10.10 & 10.180.10.254 respectively on my diagram).

One piece of information - we actually want the links to terminate directly on core-vss (6800) but for various reasons the live link terminated on cab-man hence replicating the test on there as well. I've just moved the test link to terminate directly on core-vss and it works perfectly! All upload and download speeds are what we would expect. So the issue is "something" introduced with the cab-man switch. It would help to get to the bottom of this as I suspect this isnt the only place we are experiencing slow performance.

Going back to the icmp redirects - is there a good resource that explains how this works? And it does link to the cpp-policy on various switches I think. One of our third party companies was of the opinion it was a policy rate limiting the packets somewhere - but they never found where that "somewhere" was! Is there a way of detecting any evidence of drops (or rate limits) on the 3850s?

Thank you.

Jan

vb10 · ‎12-20-2019

Hi Jan,

Thanks for update. Actually, the main place, where "no ip redirect" should be configured - it's 6800 Core switch, interface which has ip address 10.180.10.254 (VLAN200, I believe)

Sorry, probably, I was not clear enough, explaining the scenario. And after your comments I understand the situation better. So, as I understand, that AP IP 192.168.50.10, when connected to sw-man works fine, right? Problem is only, when AP is connected to sw-test.

I believe, that following is happening:

A) Traffic flow between AP and WLC in working scenario, when AP is connected to sw-man, has IP 192.168.50.10.

1. AP sends packet to WLC via default gateway. Gateway for this VLAN 50 (or55) is 6800 VSS switch VLAN50 (or 55), doesn't matter

2. 6800 switch receives this packet on VLAN50, and sends to WLC 10.180.10.250 in VLAN200.

3. Return traffic goes in the same way. 6800 switch,as default gateway, receives packet from WLC in VLAN200 and forwards it to AP 192.168.50.10 in VLAN50.

4. Everything is working fine here

B) Traffic flow with bad perfromance, whne AP is connected to switch sw-test ahd has IP 192.168.2.10

1. AP sends packet to default gateway, which is sw-test. Then packet goes to sw-man switch, and then from L3 perspective goes directly to WLC, because sw-man has inteface in this subnet 10.180.10.0/24

2. From L2 perspective, this traffic goes switched through 6800 VSS. This is also fine, doesn't cause any issues.

3. When WLC sends return traffic to AP, it's when issue happens. WLC sends packet to default gateway, which is 6800 switch, IP 10.180.10.254 VLAN200. 6800 needs to send it further to AP 192.168.2.10. 6800 VSS doesn't have interface in this subnet, instead, this route most likely is available via sw-man, 10.180.10.10. It's also VLAN200. So, 6800 switch receives packet from WLC in VLAN200, and sends it via VLAN200. That's when ICMP redirect is generated and software switching is happening.

There are couple of more evidences:

1. Software switching is happening on direction from WLC to AP, meaning in "download" direction from WiFI client point of view. That's why you saw download speed less than upload

2. When you moved subnet directly to 6800 VSS, you avoided same inteface forwarding, that's why it started to work.

I'm pretty sure, that if you will configure "no ip redirect" VSS 6800 VLAN200 (IP 10.180.10.254), it will start to work fine, even with previous design.

Could you please confirm?

Resources (though, you can find a lot of others on Cisco site)

Regading ICMP redirect:

https://www.cisco.com/c/en/us/support/docs/ip/routing-information-protocol-rip/13714-43.html

Regarding CoPP:

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/15-1SY/config_guide/sup2T/15_1_sy_swcg_2T/control_plane_policing_copp.html