Re: Can the WLC capture be trusted?

patoberli · ‎12-01-2022

Hi All

I'm debugging a EAP-TLS issue at the moment, between some iPads a WLC 8540 with 8.10.181.3 and 2800 APs.
It works with 2700 AP models on this WLC, but nearly always fails with 2800 APs.

I've now captured the radius communication between the WLC and ISE and it seems that either the WLC doesn't receive all (fragmented) packets, or the capture on the WLC is not trustworthy. The whole setup worked fine with 8.5.140.0. Disabling 802.11r (Fast Transition) didn't help. The ISE shows a lot of Radius communication, until the client starts a new session.

I've used this information for the capture: http://wifinigel.blogspot.com/2014/08/cisco-wlc-per-client-packet-capture.html

On the left side is the capture from the WLC, on the right side is the capture from the firewall interface between the WLC and ASA. Please note, I didn't do the captures at the exact same time, but the symptom is the same always. ASA interface and WLC Management Interface are on the same VLAN/segment. As you can see, the third fragmented Radius packet, is either not completely captured by the WLC capture function, or indeed lost.

Have you seen something like this already?

I haven't yet had the possibility to capture on the switch where the WLC is attached to.

Thanks
Patrick

Leo Laohoo · ‎12-01-2022

There are several known (and private) bugs affecting 2800/3800/4800/1560/6000. It has something to do with the MARVELL chipset and how it would "blackhole", if not delay, random packets.

Use a different AP, like a 2700/3700, and everything works. Use the above-mentioned APs and things get really, really weird.

patoberli · ‎12-01-2022

Hi Leo

Thanks for your reply. Do you happen do have one or two BugIDs?

I just realized, after reading the captures again, it's the traffic from WLC to ISE that is missing in the WLC capture, but is correctly arriving on the ISE. So I guess it's indeed a capture failure that I have found and not the actual issue the clients are facing. Damn, the search continues.

Leo Laohoo · ‎12-01-2022

No, I do not have those Bug IDs handy with me. I will go look for them and update this thread.

We have concluded a case with TAC when RTP traffic gets blackholed by 4800. We have tried 8.10MR6 and 8.10MR7 and the issue is there. There is no fix other than upgrade to 17.X.X. We spent about 7 months and several hundred of hours of OTAP capturing.

Leo Laohoo · ‎12-01-2022

List of Bug IDs (public & private) affecting exclusively 2800/3800/4800/1562

CSCvw86217, CSCwe74653, CSCwd91054, CSCvq40071 (CSCvt04753, CSCvs29318, CSCvt94652, CSCvt11851), CSCvm51356, CSCwe55390, CSCve57121, CSCvq90572, CSCvd64819, CSCvu61194, CSCwe89429, CSCvc67005, CSCwd37092, CSCvz66623, CSCwd46815, CSCvt3781, CSCvs25798, CSCvz08781, CSCvp36540/CSCvp57188/CSCvo75757, CSCwf69575, CSCwd41463, CSCwa73245 (CSCvm60915, CSCvn66715, CSCvp72309, CSCvt22353, CSCvt37815, CSCvw86217, CSCvy03507, CSCwa30802, CSCvu81597, CSCvu94488, CSCvv78719, CSCvv86336, CSCvv97317, CSCvw47752), CSCvz05686, CSCwh03842, CSCwi21214, CSCwi96089, CSCwh74663, CSCwj04146, CSCwj54973, CSCwj89538, CSCwj74832, CSCwk55224, CSCwp05354 and more.

FN - 74035 - Cisco Access Points May Not Detect Radar on the Required Levels After Channel Availability Check Time

NOTE:

This list will/does not included software-induced AP crashes, like FIQ/NMI, because they are "a-dime-a-dozen".

ammahend · ‎12-01-2022

Can you share Show ap tcp-mss-adjust <2800 ap where is fails> and also from 2700 ap where it works, also endpoint debug from ise will provide some additional insight if you can share.

-hope this helps-

patoberli · ‎12-01-2022

We have already globally set the MSS to 1250, before on the 8.5.140.0 the option was disabled. I verified this with the show command.

I sadly can't share the capture publicly, but I have made some screenshots from the ISE for one of the affected clients. See here:

ammahend · ‎12-02-2022

I am wondering why you have fragmentation then, with 1250 mss with normal network and capwap overhead it should not exceed 1500 mtu unless there is an isp in between which supports lower mtu or something else to that effect.
After extracting eap response over multiple translations, before ise can continue with CRL verification and extract client certificate eap times out on client, but 37 second for this is clearly not normal. Specially with one AP model only, may be one of the bugs mentioned above.

-hope this helps-

patoberli · ‎12-06-2022

I suspect the fragmets are because the certificates are to large for a single packet. The WLC has the radius framed mtu set to 1400, so I suspect that is the reason for the visible fragments. The network itself should support everywhere an MTU of 1500.

TAC is still analyzing the logs at the moment.

ammahend · ‎12-06-2022

Understood, although eap-tls being sent in multiple packets is not fragmentation, and as long as radius framed-mtu value is less than or equal to the CAPWAP MTU, I don’t think it’s a fragmentation problem.
keep us updated on what TAC finds.

-hope this helps-

Packet Pusher · ‎01-17-2023

I believe I'm experiencing something very similar on 8.10.183.0 with 4800 APs (3700s do not appear to be impacted). Did you ever find a root cause or resolution?

Leo Laohoo · ‎01-17-2023

@Packet Pusher wrote:
I believe I'm experiencing something very similar on 8.10.183.0 with 4800 APs (3700s do not appear to be impacted). Did you ever find a root cause or resolution?

We have completed an exhaustive (8 months and several hundred hours of packet captures and OTA captures) investigation with TAC about issues with our fleet of 4800 and AireOS (8.10 MR6 and MR7). The issue is voice traffic fails when the handsets (CP-8821 & ASCOM i62) are joined to the 4800. However, we have no issues if the phones are joined to 9130 and the same WLC firmware.

We disabled WMM and the problem goes away. The TAC engineer hinted that we should upgrade the APs to 9130 but when I pressed him for more information, he clammed up.

The only fix was to migrate to IOS-XE.

Packet Pusher · ‎01-17-2023

You had indicated authentication related issues on the 4800s as well - is that correct? Was there a root cause or resolution for that?

Leo Laohoo · ‎01-17-2023

Authentication issues affecting 2800/3800/4800/1560/IW6200 is a different issue: CSCwd37092

patoberli · ‎01-20-2023

Hi Leo

Tac is now a few steps further and we have a working workaround. Switch to 40 MHz channels instead of 80 MHz. Here 80 MHz works fine, thanks to thick walls. It also worked fine with the same infrastructure on 8.5.x.

Now it's time to discover if it's only affecting Apple devices (customer is near exclusively using Apple) and caused by them or if it's a WLC bug. The apple devices are configured with a special MDM, which isn't Windows capable. I'll keep you updated.