08-17-2023 02:29 PM
We have a 1562D RAP with three child 1562D MAPs associated to 8540s. They are used as bridges with several VLANs trunked across.
This morning, I upgraded one of our 8540 HA pairs from 8.10.183.0 to 8.10.185.3. Everything was working from about 4:15am until 9:15am, at which point all three of the MAPs disassociated. Rebooting the RAP restored connectivity, for a time. Then a couple of the MAPs went down again. Instead of rebooting the RAP again, I admin-disabled the RAP and re-enabled it, which restored connectivity again.
I realized that the RAP and one of the MAPs were associated to 8540-1 (which I upgraded this morning) and the other two MAPs were associated to 8540-2 (which is still on 183.0). I moved the remaining two to 8540-1 so they're all on the same WLC and code, and they did come back up for a while, but a few hours later, now all three are down again.
Oddly enough, all three MAPs show as children of the RAP. But, they aren't associated to the WLC and aren't pingable.
I opened a TAC case and got a call from an agent who took a quick look at things, however, a higher-level engineer is currently unavailable (the phrase "under staffing" was used but was quickly corrected to "no engineers available") and I'm waiting for a callback in 30-45 minutes.
Unfortunately, these bridges support monitoring a robotic milking machine in a dairy barn 24/7 and the end users get very upset when the network is down. If TAC can't resolve it soon, I'll probably end up moving the bridges to 8540-2 on the old code to hopefully stabilize them. Meanwhile, I thought I'd post here and see if anyone ran into the same issue and found a solution.
Solved! Go to Solution.
08-30-2023 06:55 AM - edited 10-03-2023 01:54 PM
Switching to 20 MHz channels instead of 80 MHz has stabilized the connection. (We're using channel 149.) We only get 100Mbps on a speed test now instead of 200+Mbps, but it's good enough for them and stability and newer code are more important. I don't know about 40MHz, but at this point, I'm not willing to do any more testing at the expense of the users in the buildings.
The RAP with a single MAP is still running fine on 80 MHz on the same controller and same configuration.
I was informed by TAC that if this issue is a bug (which it certainly seems to be), they cannot pursue it as they cannot engage the development team due to AireOS being end of maintenance.
UDPATE Oct. 3, 2023 - Today, the RAP with a single MAP that was running fine on 8.10.185.3 and 80 MHz for the backhaul disconnected. Rebooting the RAP restored connectivity. Unless it was a coincidence, it seems whatever bug this is also affects RAPs with single MAPs, it just takes longer. (Or, perhaps, there was finally some strange frame that traversed the link, which may occur more frequently with the other ones that experienced the issue faster... trying to keep an open mind.)
I have changed this pair to a 20 MHz channel, since the RAP with three MAPs that was having the issue on 80 MHz has remained stable on 20 MHz.
08-18-2023 12:15 AM
- Provide logs from the 8540 , RAPs and MAPs when the MAPs disassociate , (especially from the controller)
M.
08-19-2023 03:34 AM
Are the APs allowed to ping their default gateway?
08-21-2023 02:20 PM
Rich, yes, they are.
Marce, logs from the RAP are in the spoiler below. But I think the logs from the time the MAPs went down are gone. Earliest logs at 15:54, they went down at 15:47. I'm unable to access MAP console, and once I disable/reenable the RAP to (temporarily) reestablish connection, the MAP logs from the time of disconnection are gone. Anyway, is there a place to config analyzer for show log and show tech-support from the APs themselves?
TAC was not particularly helpful the other day. We didn't see anything unusual in logs or configs. I saw that the primary channel on the 80 MHz backhaul was either 153 or 157 (I don't recall which), so we agreed to change it to 149. A few hours later, it went down again, so the next day I moved the RAP and the 3 MAPs to another WLC still on 8.10.183.0 code, which is better. Meanwhile, I moved another RAP/single MAP bridge in an empty building to the 185.3 controller also using channel 149 and 80 MHz. It has been stable.
Today, we noticed that one of the MAPs has been sporadically droppings pings on its management IP, and its SNR was consistently lower than the other two (around 10 vs. around 30). The RAP is a 1562I and is about 1,600 feet away from the MAPs. A 1562D is better for the application, plus I suspected a possible hardware issue, so replaced the 1562I RAP with a 1562D. Throughput went way up from 60Mbps to over 200Mbps both ways and SNR on all three links was around 30, but an hour and a half later, as usual, all three MAPs went down (still bridged to the RAP but management IP not pingable and not associated to WLC).
I'll be discussing further with TAC tomorrow.
08-30-2023 06:55 AM - edited 10-03-2023 01:54 PM
Switching to 20 MHz channels instead of 80 MHz has stabilized the connection. (We're using channel 149.) We only get 100Mbps on a speed test now instead of 200+Mbps, but it's good enough for them and stability and newer code are more important. I don't know about 40MHz, but at this point, I'm not willing to do any more testing at the expense of the users in the buildings.
The RAP with a single MAP is still running fine on 80 MHz on the same controller and same configuration.
I was informed by TAC that if this issue is a bug (which it certainly seems to be), they cannot pursue it as they cannot engage the development team due to AireOS being end of maintenance.
UDPATE Oct. 3, 2023 - Today, the RAP with a single MAP that was running fine on 8.10.185.3 and 80 MHz for the backhaul disconnected. Rebooting the RAP restored connectivity. Unless it was a coincidence, it seems whatever bug this is also affects RAPs with single MAPs, it just takes longer. (Or, perhaps, there was finally some strange frame that traversed the link, which may occur more frequently with the other ones that experienced the issue faster... trying to keep an open mind.)
I have changed this pair to a 20 MHz channel, since the RAP with three MAPs that was having the issue on 80 MHz has remained stable on 20 MHz.
08-30-2023 07:26 AM
LOL I have the same discussion with my colleagues every week now, including earlier today!
Looks like a bug in AireOS, we know they won't fix it, so either find a workaround or look at migrating to 9800. At least then if the bug is still there there's a chance it will get fixed.
08-30-2023 07:33 AM
Honestly, that's the reason I spent this much time trying to troubleshoot. The ultimate goal is to move to the 9800, but I haven't done much testing with bridging on that yet. I thought I'd be helpful to my future self and help TAC find an issue on the off chance it would also affect IOS and get fixed there, but so much for that.
10-03-2023 01:51 PM - edited 10-03-2023 01:55 PM
UDPATE Oct. 3, 2023 - Today, the RAP with a single MAP that was running fine on 8.10.185.3 and 80 MHz for the backhaul disconnected. Rebooting the RAP restored connectivity. Unless it was a coincidence, it seems whatever bug this is also affects RAPs with single MAPs, it just takes longer. (Or, perhaps, there was finally some strange frame that traversed the link, which may occur more frequently with the other ones that experienced the issue faster... trying to keep an open mind.)
I have changed this pair to a 20 MHz channel, since the RAP with three MAPs that was having the issue on 80 MHz has remained stable on 20 MHz.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide