cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
784
Views
10
Helpful
39
Replies
dmcgrath.ca
Beginner

AP crashing after installing AIR-RM3000AC modules into 3602i

Hi,

Recently we decided to grab a box of second hand AC modules for our 3602i access points. Things seemed fine at first, but during the day, it seems like randomly we will get some or several at once (last was 3 all within moments of each other) AP time out with lines similar to:

*spamApTask7: Jul 09 14:28:26.134: %CAPWAP-3-MAX_RETRANSMISSIONS_REACHED: capwap_ac_sm.c:7673 Max retransmissions reached on AP(XX:XX:XX:XX:XX:XX),message (CAPWAP_CONFIGURATION_UPDATE_REQUEST^M ),number of pending messages(1) 

I was poking around, but couldn't find any obvious reasons. Looking at the crash logs, it seems that the APs have a memory corruption of sorts, but for unknown reasons. It should be noted that the APs are on a 3750 switch with a 20w port config maximum using enhanced poe since the 15.4w isn't enough. Could this be a cause? Yet only affects some?

Here are a few samples of some of the crashes, at least the few lines before and after the crash:

*Jul 23 12:33:30.719: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio1, changed state to up
*Jul 23 12:33:48.603: %CLEANAIR-6-STATE: Slot 0 enabled
*Jul 23 12:33:50.363: %CLEANAIR-6-STATE: Slot 1 enabled
*Jul 23 12:34:06.819: %CDP_PD-4-POWER_OK: Full power - NEGOTIATED inline power source
*Jul 23 12:34:29.727: %DOT11-6-DFS_SCAN_COMPLETE: DFS scan complete on frequency 5540 MHz
*Jul 23 12:34:29.735: %DOT11-6-DFS_SCAN_COMPLETE: DFS scan complete on frequency 5540 MHz
*Jul 23 13:50:01.887: %SYS-3-BADMAGIC: Corrupt block at 99D9E60 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:50:01.887: %SYS-6-MTRACE: mallocfree: addr, pc
  B215238,1188D4C   B215238,40000294  A68F4F8,1188D24   A68F4F8,4000020A
  99D9E90,18B1894   A6A83F4,60000066  A6A7E18,18B6A7C   B652704,60000C5E
*Jul 23 13:50:01.887: %SYS-6-MTRACE: mallocfree: addr, pc
  B651D18,18B6A7C   B652F2C,6000084A  B652704,11F69F8   B652704,11F68B0 
  B652704,400003FC  AB95630,60000044  AB954D8,1AE99E0   AB954D8,1AE9950 
*Jul 23 13:50:01.887: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 99D9E60, words 25914424, alloc 0, Free, dealloc 0, rfcnt 99CCDA8
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E60: 0x0 0x0 0x0 0x0
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E70: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E80: 0x99CCDA8 0x0 0x0 0x0

 13:50:01 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 
*Jul 23 00:43:57.587: %LINK-5-CHANGED: Interface Dot11Radio0, changed state to administratively down
*Jul 23 00:43:57.603: %DOT11-5-EXPECTED_RADIO_RESET: Restarting Radio interface Dot11Radio0 due to the reason code 10
*Jul 23 00:43:57.607: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to up
*Jul 23 00:43:58.595: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio0, changed state to down
*Jul 23 00:43:58.623: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to down
*Jul 23 00:43:58.631: %LINK-5-CHANGED: Interface Dot11Radio0, changed state to reset
*Jul 23 00:43:59.655: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to up
*Jul 23 00:44:00.655: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio0, changed state to up
*Jul 23 11:35:25.767: %SYS-3-BADMAGIC: Corrupt block at 9926BD0 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 11:35:25.767: %SYS-6-MTRACE: mallocfree: addr, pc
  AFF3B2C,1188D4C   AFF3B2C,40000294  AB8CAE4,1188D24   AB8CAE4,4000020A
  9926C00,18B1894   AFF4108,60000066  AFF3B2C,18B6A7C   A6DEFCC,600004DE
*Jul 23 11:35:25.767: %SYS-6-MTRACE: mallocfree: addr, pc
  A6DDBF4,500004DE  A6DE5E0,18B6A7C   8D2CEBC,1AE99E0   8D2CEBC,1AE9950 
  8D2CEBC,300000AA  8D2CEBC,1AE99E0   8D2CEBC,1AE9950   8D2CEBC,300000AA
*Jul 23 11:35:25.767: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 9926BD0, words 25914424, alloc 0, Free, dealloc 0, rfcnt 991944C
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BD0: 0x0 0x0 0x0 0x0
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BE0: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BF0: 0x991944C 0x0 0x0 0x0

 11:35:25 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 
*Jul 23 12:34:25.599: %CLEANAIR-6-STATE: Slot 0 enabled
*Jul 23 12:34:27.363: %CLEANAIR-6-STATE: Slot 1 enabled
*Jul 23 12:34:44.883: %CDP_PD-4-POWER_OK: Full power - NEGOTIATED inline power source
*Jul 23 12:46:26.643: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=51B36A4, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 12:46:26.747: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=5167920, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 12:46:26.851: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=51B36A4, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-3-BADMAGIC: Corrupt block at 9C57234 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-6-MTRACE: mallocfree: addr, pc
  AA8BA50,1188D4C   AA8BA50,40000294  AA8D330,1188D24   AA8D330,4000020A
  9C57264,18B1894   AA8D330,18B6A7C   AA8BA50,18B6A7C   7B027BC,14882C4 
*Jul 23 13:48:46.199: %SYS-6-MTRACE: mallocfree: addr, pc
  7B027BC,4000010C  7B02FE4,60000DC0  7B027BC,11F69F8   7B027BC,11F68B0 
  7B027BC,400003FC  7B02FE4,60000DC0  7B027BC,11F69F8   7B027BC,11F68B0 
*Jul 23 13:48:46.199: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 9C57234, words 25914424, alloc 0, Free, dealloc 0, rfcnt 9C49AB0
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57234: 0x0 0x0 0x0 0x0
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57244: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57254: 0x9C49AB0 0x0 0x0 0x0

 13:48:46 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 

Seems like there is a bit of variety here. That said, this didn't start until we put the AC modules in. Currently we are on 8.5.160, but this was happening on .151 as well.

At first I thought maybe there was just some port-channel saturation going on that was causing the capwap packets to get dropped for too long, but the AP crashes suggest that this isn't the case. Regardless, I changed the native VLAN away and removed some VLANs from the trunk, just in case some of our other traffic was bottle necking it.

Worth mentioning is that one of the events that happened at 35 minutes after did occur at the same time as a slight spike in internet traffic on that APs port, so it was under load when the reset happened. I am wondering if that is why the other APs went at the same time, all within similar area; guess is that the client downloading loaded 1 AP, then failed to another one and caused load there, and again etc, each time causing the APs to perhaps draw not enough power and brownout, thus the memory issues.

Now, if this is the case, shouldn't 20w on an 3750E switch with enhanced pow be sufficient to power these AC modules?

Any tips or advice would be appreciated!

39 REPLIES 39

One more thing, although I think it's the physical problem.
In you initial post you wrote that you have capped the PoE power to 20W? That might be a tiny bit to little if the cables are long. I would probably increase it to 21 or 22W.


@patoberli wrote:
One more thing, although I think it's the physical problem.
In you initial post you wrote that you have capped the PoE power to 20W? That might be a tiny bit to little if the cables are long. I would probably increase it to 21 or 22W.

Unfortunately the limit for the 3750-E is 20w for "enhanced poe", so there is no option to increase it. Also, the Cisco employee at the beginning of this thread already said that the module should only need just under 20w, and the specs that I read suggested that it should be fine for up to 100m. Although I did see some mention of distance being a possible issue, so we didn't rule it out and even considered grabbing a cheap poe+ unmanaged switch as a test.

We are trying to avoid wasting hundreds of hours and thousands of dollars on a solution that has a 50/50 chance of working.

That said, one of our APs is in the same room as the switch and even that had a similar problem, so I am leaning towards the cable length not being a problem.

Thanks for the suggestion though!

dmcgrath.ca
Beginner

A side note: We are using multiple regulatory domains here: US and NL, and the APs and AC modules are both in -A domain. I mention because I did noticed that when I tried a factory reset of one of the APs that I got some odd errors about invalid country codes, which I assume were more informational:

Jul 24 20:55:02 wlc.blender WLC-2504: *spamApTask4: Jul 24 20:55:01.838: %LWAPP-3-RD_ERR7: spam_lrad.c:12254 The system detects an invalid country code () for AP aa:bb:cc:dd:ee:ff                                                                                                                                                           Jul 24 20:55:02 wlc.blender WLC-2504: *spamApTask4: Jul 24 20:55:01.838: %LWAPP-3-RD_ERR9: spam_lrad.c:13481 APs aa:bb:cc:dd:ee:ff country code changed from () to (US )                                                                                                                                                                      Jul 24 20:55:02 wlc.blender WLC-2504: *spamApTask4: Jul 24 20:55:01.838: %LWAPP-3-RD_ERR7: spam_lrad.c:12254 The system detects an invalid country code () for AP aa:bb:cc:dd:ee:ff                                                                                                                                                           Jul 24 20:55:02 wlc.blender WLC-2504: *spamApTask4: Jul 24 20:55:01.839: %LWAPP-3-RD_ERR9: spam_lrad.c:13481 APs aa:bb:cc:dd:ee:ff country code changed from () to (US )                                                                                                                                                                      Jul 24 20:55:02 wlc.blender WLC-2504: *spamApTask4: Jul 24 20:55:01.839: %LWAPP-3-RD_ERR7: spam_lrad.c:12254 The system detects an invalid country code () for AP aa:bb:cc:dd:ee:ff                                                                                                                                                           Jul 24 20:55:02 wlc.blender WLC-2504: *spamApTask4: Jul 24 20:55:01.839: %LWAPP-3-RD_ERR9: spam_lrad.c:13481 APs aa:bb:cc:dd:ee:ff country code changed from () to (US )

Just wondering if it's maybe also related to another bug I found:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCur83535/?rfs=iqvred

In the past I have noticed some odd channel assignments. Wondering if maybe there could be a bug in there to do with multiple regulatory domains? I have doubts, but worth mentioning perhaps.

You need to install the correct country code device (ap and module) is the specific regulatory domain. You should not mix that up.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
You need to install the correct country code device (ap and module) is the specific regulatory domain. You should not mix that up.

Yes I know this :p The APs are all US (-A), but the WLC is in the EU, specifically NL, so it has to also be selected so that the WLC will pick the lowest common denominator in terms of channel selection and power so that the EIRP falls within the regulatory domain. So we pick both, which causes all the APs, despite being "American", to use the same channels. Since they are 3602i's, there is not concern with link budgets that you would have with antenna gains and such on the external version 3602e, so in our case, having NL,US allows us to not have to worry about channel selection.

That said, I do also disable UNI-III channels and stuff. Point is that it should "just work", but possible that there is some quirk I ran into here.

8.5 and after, there is no common denominator for multiple country code. As far as the issue you are running into, well hopefully your TAC case will help providing your root cause.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
8.5 and after, there is no common denominator for multiple country code. As far as the issue you are running into, well hopefully your TAC case will help providing your root cause.

Interesting. Do you have reference to documentation that mentions when/why this was changed? Either way, the system we changed to US only and it still crashes the APs.

Here is the v8.5 guide.

https://www.cisco.com/c/en/us/td/docs/wireless/controller/8-5/config-guide/b_cg85/country_codes.html

This doesn’t point to your issue, your issue is a bug or just bad module.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
Here is the v8.5 guide.

https://www.cisco.com/c/en/us/td/docs/wireless/controller/8-5/config-guide/b_cg85/country_codes.html

This doesn’t point to your issue, your issue is a bug or just bad module.

Thanks for the URL, but this document is the one that I believe that I consulted during my initial research. Specifically, the statement:

​When multiple countries are configured and the RRM auto-RF feature is enabled, the RRM assigns the channels that are derived by performing a union of the allowed channels per the AP country code.

I think I maybe misread union as intersection, and thus was thinking that US + Japan for example would not include channel 14. Either way, I should be able to restrict the channels and power levels to at least comply as best I can given the situation.

As for the underlying problem, I tend to thing it is more of a bug than power of hardware. I am going to try some AP groups and RF profiles and try clean things up a bit. See how it goes.

Thanks for the help!


@Scott Fella wrote:
Here is the v8.5 guide.

https://www.cisco.com/c/en/us/td/docs/wireless/controller/8-5/config-guide/b_cg85/country_codes.html

This doesn’t point to your issue, your issue is a bug or just bad module.

Sadly, this is looking more and like the case as we tried a power adapter just to rule out any POE issues (cable loss, bad switch or connecters, etc.), and today it rebooted itself again. Odd that it didn't leave a crash log on the WLC, but it wouldn't be the first time that this happened.

Content for Community-Ad