cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
787
Views
10
Helpful
39
Replies
dmcgrath.ca
Beginner

AP crashing after installing AIR-RM3000AC modules into 3602i

Hi,

Recently we decided to grab a box of second hand AC modules for our 3602i access points. Things seemed fine at first, but during the day, it seems like randomly we will get some or several at once (last was 3 all within moments of each other) AP time out with lines similar to:

*spamApTask7: Jul 09 14:28:26.134: %CAPWAP-3-MAX_RETRANSMISSIONS_REACHED: capwap_ac_sm.c:7673 Max retransmissions reached on AP(XX:XX:XX:XX:XX:XX),message (CAPWAP_CONFIGURATION_UPDATE_REQUEST^M ),number of pending messages(1) 

I was poking around, but couldn't find any obvious reasons. Looking at the crash logs, it seems that the APs have a memory corruption of sorts, but for unknown reasons. It should be noted that the APs are on a 3750 switch with a 20w port config maximum using enhanced poe since the 15.4w isn't enough. Could this be a cause? Yet only affects some?

Here are a few samples of some of the crashes, at least the few lines before and after the crash:

*Jul 23 12:33:30.719: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio1, changed state to up
*Jul 23 12:33:48.603: %CLEANAIR-6-STATE: Slot 0 enabled
*Jul 23 12:33:50.363: %CLEANAIR-6-STATE: Slot 1 enabled
*Jul 23 12:34:06.819: %CDP_PD-4-POWER_OK: Full power - NEGOTIATED inline power source
*Jul 23 12:34:29.727: %DOT11-6-DFS_SCAN_COMPLETE: DFS scan complete on frequency 5540 MHz
*Jul 23 12:34:29.735: %DOT11-6-DFS_SCAN_COMPLETE: DFS scan complete on frequency 5540 MHz
*Jul 23 13:50:01.887: %SYS-3-BADMAGIC: Corrupt block at 99D9E60 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:50:01.887: %SYS-6-MTRACE: mallocfree: addr, pc
  B215238,1188D4C   B215238,40000294  A68F4F8,1188D24   A68F4F8,4000020A
  99D9E90,18B1894   A6A83F4,60000066  A6A7E18,18B6A7C   B652704,60000C5E
*Jul 23 13:50:01.887: %SYS-6-MTRACE: mallocfree: addr, pc
  B651D18,18B6A7C   B652F2C,6000084A  B652704,11F69F8   B652704,11F68B0 
  B652704,400003FC  AB95630,60000044  AB954D8,1AE99E0   AB954D8,1AE9950 
*Jul 23 13:50:01.887: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 99D9E60, words 25914424, alloc 0, Free, dealloc 0, rfcnt 99CCDA8
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E60: 0x0 0x0 0x0 0x0
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E70: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E80: 0x99CCDA8 0x0 0x0 0x0

 13:50:01 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 
*Jul 23 00:43:57.587: %LINK-5-CHANGED: Interface Dot11Radio0, changed state to administratively down
*Jul 23 00:43:57.603: %DOT11-5-EXPECTED_RADIO_RESET: Restarting Radio interface Dot11Radio0 due to the reason code 10
*Jul 23 00:43:57.607: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to up
*Jul 23 00:43:58.595: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio0, changed state to down
*Jul 23 00:43:58.623: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to down
*Jul 23 00:43:58.631: %LINK-5-CHANGED: Interface Dot11Radio0, changed state to reset
*Jul 23 00:43:59.655: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to up
*Jul 23 00:44:00.655: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio0, changed state to up
*Jul 23 11:35:25.767: %SYS-3-BADMAGIC: Corrupt block at 9926BD0 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 11:35:25.767: %SYS-6-MTRACE: mallocfree: addr, pc
  AFF3B2C,1188D4C   AFF3B2C,40000294  AB8CAE4,1188D24   AB8CAE4,4000020A
  9926C00,18B1894   AFF4108,60000066  AFF3B2C,18B6A7C   A6DEFCC,600004DE
*Jul 23 11:35:25.767: %SYS-6-MTRACE: mallocfree: addr, pc
  A6DDBF4,500004DE  A6DE5E0,18B6A7C   8D2CEBC,1AE99E0   8D2CEBC,1AE9950 
  8D2CEBC,300000AA  8D2CEBC,1AE99E0   8D2CEBC,1AE9950   8D2CEBC,300000AA
*Jul 23 11:35:25.767: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 9926BD0, words 25914424, alloc 0, Free, dealloc 0, rfcnt 991944C
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BD0: 0x0 0x0 0x0 0x0
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BE0: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BF0: 0x991944C 0x0 0x0 0x0

 11:35:25 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 
*Jul 23 12:34:25.599: %CLEANAIR-6-STATE: Slot 0 enabled
*Jul 23 12:34:27.363: %CLEANAIR-6-STATE: Slot 1 enabled
*Jul 23 12:34:44.883: %CDP_PD-4-POWER_OK: Full power - NEGOTIATED inline power source
*Jul 23 12:46:26.643: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=51B36A4, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 12:46:26.747: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=5167920, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 12:46:26.851: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=51B36A4, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-3-BADMAGIC: Corrupt block at 9C57234 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-6-MTRACE: mallocfree: addr, pc
  AA8BA50,1188D4C   AA8BA50,40000294  AA8D330,1188D24   AA8D330,4000020A
  9C57264,18B1894   AA8D330,18B6A7C   AA8BA50,18B6A7C   7B027BC,14882C4 
*Jul 23 13:48:46.199: %SYS-6-MTRACE: mallocfree: addr, pc
  7B027BC,4000010C  7B02FE4,60000DC0  7B027BC,11F69F8   7B027BC,11F68B0 
  7B027BC,400003FC  7B02FE4,60000DC0  7B027BC,11F69F8   7B027BC,11F68B0 
*Jul 23 13:48:46.199: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 9C57234, words 25914424, alloc 0, Free, dealloc 0, rfcnt 9C49AB0
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57234: 0x0 0x0 0x0 0x0
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57244: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57254: 0x9C49AB0 0x0 0x0 0x0

 13:48:46 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 

Seems like there is a bit of variety here. That said, this didn't start until we put the AC modules in. Currently we are on 8.5.160, but this was happening on .151 as well.

At first I thought maybe there was just some port-channel saturation going on that was causing the capwap packets to get dropped for too long, but the AP crashes suggest that this isn't the case. Regardless, I changed the native VLAN away and removed some VLANs from the trunk, just in case some of our other traffic was bottle necking it.

Worth mentioning is that one of the events that happened at 35 minutes after did occur at the same time as a slight spike in internet traffic on that APs port, so it was under load when the reset happened. I am wondering if that is why the other APs went at the same time, all within similar area; guess is that the client downloading loaded 1 AP, then failed to another one and caused load there, and again etc, each time causing the APs to perhaps draw not enough power and brownout, thus the memory issues.

Now, if this is the case, shouldn't 20w on an 3750E switch with enhanced pow be sufficient to power these AC modules?

Any tips or advice would be appreciated!

39 REPLIES 39


@Scott Fella wrote:
It’s something that has been a pain for me for many years. Keep tabs on the show ap uptime once a week or so and track what ap bounces. If you see an ap bounce a few times a month or even a year, it may help determine what is the culprit depending on what was the root cause.
While you are at it, get some canned air and also clean the Ethernet port and the rj45 end, in case it is dusty or dirty from the install.

Just a heads up that we tried re-seating the module, blowing it out etc., but still they seem extremely unstable specifically when the 5ghz on the AC module itself is used, but not just association events. Seems more when actual traffic happens.

That said, earlier today I made a quick SSID for some local EAP 802.1x auth test, and I even had a user cause a reset to the AP just trying to associate, yet another user was able to login just fine. It's still very hit or miss.

It's the weekend now, so in the meantime I just disabled all the 5ghz AC radios, and see if the APs run stable with them installed so that we don't have to go around and pull them all out.

Needless to say, it's very frustrating that we can't solve it :(

Remove the modules for a few weeks and monitor. That will tell you for sure if its the module or not. We have had thousands of access points, and majority never gave us issues, the then there were the 2% that caused so many issues and tech visits. Not worth it, if the issue goes away when you remove the module, my suggestion is to leave them off and eventually replace the ap. You don't want to ruin the user experience.... they will remember how bad it was for a very long time. Oh... they will also say their favorite quote, "it a network issue"!
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
Remove the modules for a few weeks and monitor. That will tell you for sure if its the module or not. We have had thousands of access points, and majority never gave us issues, the then there were the 2% that caused so many issues and tech visits. Not worth it, if the issue goes away when you remove the module, my suggestion is to leave them off and eventually replace the ap. You don't want to ruin the user experience.... they will remember how bad it was for a very long time. Oh... they will also say their favorite quote, "it a network issue"!

Rather than spend months trying A/B tests and wasting everyone else time, I was thinking about just disabling the 2.4ghz radios in most of the APs as we have coverage in mostly the entire building just fine already with 5ghz. There is a spot or two where we could use some 2.4 to get through some thick walls, but I could just turn the APs near there into non-AC APs.

I assume that if I go into the 2.4ghz radio for each AP and set it to administratively disabled, it won't use the radio in any way, and thus reach the 15.4w budget? I checked the switch and it seems they are still pulling 20w, and another AP just restarted while in this mode. I am not sure if I need to put the AP into an AP group or something specific in order to actually get the 2.4ghz radio 100% disabled. Just noticing that some of the APs were showing some statistics on the home page despite being "off", and the fact that they are still using 20w, which makes me question a few things about how I was disabling them.

That didn’t fix the issue for use. We also disabled 2.4ghz because we can as we have high density. However, too many bugs with the module, bad module or module install is the cause we removed them.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
That didn’t fix the issue for use. We also disabled 2.4ghz because we can as we have high density. However, too many bugs with the module, bad module or module install is the cause we removed them.

Just a heads up that I tried disabling the 2.4ghz and only used 5ghz + 11ac, which should be only pulling 15.4w, according to the specs:

Screenshot_2020-07-24_17-06-24.png

It clearly states that with 2.4ghz off, we should be fine. But we are using enhanced poe, which I guess is that "E-POE" column, but that says "n/a", which isn't clear is not available, or not applicable. And why is there an umlaut in there? Needs a bit of a legend, but point is that we should have had a proper 15.4w AP. It still tossed errors though, and the switch showed the port actually requesting 20w instead of 15.4, but I don't know if it actually can negotiate less than 20w with CDP when the AC module is installed. That part isn't clear from the specs.

Anyway, I tried a few thing. I will try set the regulatory to only US and see how that goes. But I did notice mention from another bug report that multiple domains can not only be a little quirky, but also that there are situations where the AP isn't actually running a 100% perfect firmware. TBH, I am rather disappointed that such a situation is even remotely possibly with Cisco gear, but I guess you can't have everything perfect!

The modules are purchased for a specific regulatory domain. Just like access points, you can’t change them. Keep that in mind.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
The modules are purchased for a specific regulatory domain. Just like access points, you can’t change them. Keep that in mind.

Yes, but everything is American. Only the WLC was set to multiple regulatory domains. So basically it's as if it's just a USA based WLC+AP+11AC setup, but the config is set to USA+EU in order to get the lowest common denomination of channels an signal strength etc.

I’m just saying that aps need to be installed in the appropriate country that it is manufactured for or else you will be breaking the regulatory domain laws. So -B for example can only be installed in the US, -M in Mexico, etc. typically you match the ap and module country code for compliance to the regulatory domain.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
I’m just saying that aps need to be installed in the appropriate country that it is manufactured for or else you will be breaking the regulatory domain laws. So -B for example can only be installed in the US, -M in Mexico, etc. typically you match the ap and module country code for compliance to the regulatory domain.

For sure, although I believe so long as we are in compliance with the EIRP limits for the regulatory domain that the items are location in, and the channels in use are restricted as well, then these devices should technically be in compliance.

Regulatory domain specifics aside, I am more curious if there is a problem with multi domain configurations.

Many folks have multi domain setups on their controller. I don’t think many mix domain on modules and aps because at the end of the day, how are you going to control that. No one ever does and that is why they state to only use the domain for the country it was manufactured for.
Anyways, I think you have all the info you need and you have a Tac case.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
Many folks have multi domain setups on their controller. I don’t think many mix domain on modules and aps because at the end of the day, how are you going to control that. No one ever does and that is why they state to only use the domain for the country it was manufactured for.
Anyways, I think you have all the info you need and you have a Tac case.

But we aren't mixing domain on modules and aps. What gave you that idea? The are -A aps and -A 11ac modules. Nowhere did I ever mention multiple domains on ap and modules.

You mentioned NL, so my assumption was you also has -A installed on those ap’s. Anyway’s, if you had issues after you installed the module, then you know the RCA. It’s up to you to proceed to leave them in and see if TAC can figure it out or just remove them. Like I said before, I started to remove all of them and not installing any modules on 3702’s that we were installing a year or two ago.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
You mentioned NL, so my assumption was you also has -A installed on those ap’s. Anyway’s, if you had issues after you installed the module, then you know the RCA. It’s up to you to proceed to leave them in and see if TAC can figure it out or just remove them. Like I said before, I started to remove all of them and not installing any modules on 3702’s that we were installing a year or two ago.

Yes, the NL is to enforce the system to at least the lowest common settings, despite them being -A domain. But yes, the issue is only with the modules. The system runs stable without them and has for a year now. In hindsight, 3700's would have been better, and in -E domain, but its late now.

Depending on what code you have, they stopped doing the least common denominator, so keep that in mind. You would be able to have multiple country code defined without being limited.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
Depending on what code you have, they stopped doing the least common denominator, so keep that in mind. You would be able to have multiple country code defined without being limited.

I did a quick test today with only US (-A) domain for everything, which is what the APs and 11ac modules are meant to be, and the problem still happens.

One odd thing I noticed was that after a factory reset of one of the APs while in -A domain, I still get that error:

Jul 27 09:42:41 wlc.blender WLC-2504: *spamApTask4: Jul 27 09:42:41.668: %LWAPP-3-RD_ERR7: spam_lrad.c:12254 The system detects an invalid country code () for AP aa:aa:aa:aa:aa:aa
Jul 27 09:42:41 wlc.blender WLC-2504: *spamApTask4: Jul 27 09:42:41.669: %LWAPP-3-RD_ERR9: spam_lrad.c:13481 APs aa:aa:aa:aa:aa:aa country code changed from () to (US )

I am still curious about how good the firmware on those things are. Currently they are running:

Screenshot_2020-07-27_06-54-59.png

I will have to look into them, but it almost sounds like the regulatory domain was left out of them during a firmware update, based on what I was reading in another post where the invalid country code is showing up in the logs.

Content for Community-Ad