cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1340
Views
8
Helpful
9
Replies

RF Handheld Roams with IP but no connectivity

thewifidude
Level 1
Level 1

Cisco 5520 in HA running 8.10.151 with (150) 2082i access points. Local mode, local controller, central switching. WPA2 AES PSK. Fast Roaming OTA. 802.11v/k enabled. Session timeout 28800. 5Ghz only, 20Mhz channels all UNII1/2/2e/3 enabled. No DFS issues. No auth/roaming problems. SNR is above voice grade.

We have over 10,000 of these Zebra TC51 Android 6.0 devices in our warehouse deployment. I have other branches runnning the same specs but they are not reporting any issue.

Issue: Zebra TC51 Android 6.0 devices in our New York branch is complaining about losing connection to their picking app. The RF shows they are connected to WiFi but they have to toggle it to re-connect. I currently have an RF handheld in this "broken" state in a RUN state with an IP address but it is not pingable. Everything at the core level checks out fine. It happens to devices with Android 6.0 randomly. Android 8 devices it does not. 

My obvious answer to the problem I just provided. This won't be appease management however because they will ask why is it working at other branches. Cisco TAC is not providing many answers but still waiting for more information. 

What could be causing this? Has anyone seen this before? Even the debugs show the device roaming from AP to AP but you cannot use it to send data and you cannot ping it. DNAC intelligent capture shows zero issues.

9 Replies 9

eglinsky2012
Spotlight
Spotlight

What’s the difference between this branch and the other branches where they are working? Are they on different controllers with different software/APs/settings?

You should upgrade WLC software since that’s several versions and many bug fixes behind. 8.10.185.0 is recommended now.

*Cue Leo to discuss the many issues with 2800s on AireOS*

By looking at your issue, it is clear some compatibility issue or a bug (client-side or AP-side). I would always look at client side first in this scenario.

1. Is it possible to upgrade OS (Android 6 to in one of the problematic TC51 on that site? If that is possible and you don't see the problem after that upgrade it is clear where the issue lies

2. If you have the same Android 6 client devices on other sites and the same WLC/AP (hardware & firmware), still not a problem experience, this is a little possibility of an infrastructure-side issue. Still from the best practice point of view, as others suggested I would go with 8.10.185.0 code upgrade to give it a chance to see if that helps.

HTH
Rasika
*** Pls rate all useful responses ***

This is exactly what I am thinking. Where I am currently at, management wants time wasted on answers versus moving on with solutions. My next step is a code upgrade otherwise no option but to expedite Android 8.0 (which they are due for in a few months).

Leo Laohoo
Hall of Fame
Hall of Fame

@thewifidude wrote:
The RF shows they are connected to WiFi but they have to toggle it to re-connect.

Let me guess: 

  • Some scanners would not get a valid IP address when roaming
  • Reboot the APs (site-wide) and the problem goes away but returns after a few days/weeks/months.

2800/3800/4800/1560 belong to the same "family" and they share one common component:  The MARVELL WiFi chipset

Over the years, people have reported bugs about 2800/3800/4800/1560 randomly dropping packets, such as DHCP, authentication, voice traffic, etc.  Because most people who reported the issue were on AireOS (plus 2800/3800/4800/1560 is approaching end-of-support date), the easiest way to "fix" this problem was to encourage people to migrate to IOS-XE (and translate to sales). 

As people slowly transition from AireOS to IOS-XE, I am seeing Bug IDs reporting very similar issues appearing in IOS-XE in the form of CSCwh03842.

The list of 2800/3800/4800/1560 Bug IDs can be found HERE.

The "mega BUG" CSCwa73245 talks about turning off MU-MIMO and some bugs which recommends turning off WMM as a workaround.

Actually they all get IP addresses and they roam just fine. Rebooting the access points did not resolve the issue as they reported the problem the very next day. The problem is strictly layer 3. I have debugs from the device while it is non-reachable, requesting DHCP, roaming, 4 way handshake etc. It's just layer 3 stops.

 

This is debug data of a roam while the device wasn't reachable. The odd part is I was told they took an RF device to another branch and they didn't have the issue. I am leaning towards there must be something with the Catalyst 9500 core. But again, I'm torn because Android 8 on the same device doesn't experience the issue. The bugs you listed however are a good find as I also have IOS-XE deployments.


@thewifidude wrote:
I have debugs from the device while it is non-reachable, requesting DHCP, roaming, 4 way handshake etc. It's just layer 3 stops.

Go through the list of Bug IDs that I have compiled.  This behaviour described really sound like a the MARVELL chipset hardware defect coming to make it's presence known.  

There is really no (permanent) fix.  No amount of upgrading/downgrading of the WLC firmware will fix it.  This is a hardware design fault and nothing can be done (unless you are "whale").  Upgrading to the latest Catalyst 9k is not a fix either because we are not sure what other people will be reporting in the coming months or years.  2800/3800/4800/1560 chipset is made by MARVELL.  Catalyst 9120/9124 and below are made by Broadcom and Catalyst 9130, 916X are made by Qualcomm.  And we were told that programming of the Broadcom chips are "challenging".  For example, have a look at CSCwh12413.  The AP in question is a 9120 (Broadcom) and this behaviour is very much like one of the bugs affecting the 2800/3800/4800/1560.

I'll be honest, I went down a rabbit hole with all those bugs and it gave me anxiety; knowing this would likely impact my whole environment. @Scott Fella is right, we know the fusion drivers on Android 6 is extremely out of date. It works at another branch but I was not included in validating that. At this point I have to quantify my time spent from an issue that has been going on since the branch has been open and suggest they expedite the upgrade. 

I will reach out to Cisco to get comment on this because it's deeply concerning. The 9120 console bug (that never got filed) is now there is a reason (I suspect) that you can now disable the serial port in the AP Join Profile in the new 9800s. This issue which cannot be found anywhere prevented us from executing any commands onto a 9120 access point that had an ethernet cable connected to the serial port, longer then 7ft. 

I'm cracking the beer for now. Thanks to you all for the insights.

Scott Fella
Hall of Fame
Hall of Fame

From reading this thread, my experience has always been an issue with the device and almost the issue was with the firmware or custom firmware on the device.  Now testing a few at a different location is a great idea, especially if they test for a few weeks with devices that have been reported as not working. You do have to be part of that testing so you can validate it.  Even if they get a few working devices from another location, if users report issues or maybe have no issues, that can help you narrow things down even more. 

-Scott
*** Please rate helpful posts ***

Rich R
VIP
VIP

To eliminate those known bugs (as much as possible) upgrade to 8.10.185.3 (link below) which replaces 8.10.185.0 (mentioned above by @eglinsky2012) because DFS is broken on those APs in 8.10.185.0 (and 8.10.151.0 for that matter).

Review Cisco Networking for a $25 gift card