cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
739
Views
10
Helpful
39
Replies
dmcgrath.ca
Beginner

AP crashing after installing AIR-RM3000AC modules into 3602i

Hi,

Recently we decided to grab a box of second hand AC modules for our 3602i access points. Things seemed fine at first, but during the day, it seems like randomly we will get some or several at once (last was 3 all within moments of each other) AP time out with lines similar to:

*spamApTask7: Jul 09 14:28:26.134: %CAPWAP-3-MAX_RETRANSMISSIONS_REACHED: capwap_ac_sm.c:7673 Max retransmissions reached on AP(XX:XX:XX:XX:XX:XX),message (CAPWAP_CONFIGURATION_UPDATE_REQUEST^M ),number of pending messages(1) 

I was poking around, but couldn't find any obvious reasons. Looking at the crash logs, it seems that the APs have a memory corruption of sorts, but for unknown reasons. It should be noted that the APs are on a 3750 switch with a 20w port config maximum using enhanced poe since the 15.4w isn't enough. Could this be a cause? Yet only affects some?

Here are a few samples of some of the crashes, at least the few lines before and after the crash:

*Jul 23 12:33:30.719: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio1, changed state to up
*Jul 23 12:33:48.603: %CLEANAIR-6-STATE: Slot 0 enabled
*Jul 23 12:33:50.363: %CLEANAIR-6-STATE: Slot 1 enabled
*Jul 23 12:34:06.819: %CDP_PD-4-POWER_OK: Full power - NEGOTIATED inline power source
*Jul 23 12:34:29.727: %DOT11-6-DFS_SCAN_COMPLETE: DFS scan complete on frequency 5540 MHz
*Jul 23 12:34:29.735: %DOT11-6-DFS_SCAN_COMPLETE: DFS scan complete on frequency 5540 MHz
*Jul 23 13:50:01.887: %SYS-3-BADMAGIC: Corrupt block at 99D9E60 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:50:01.887: %SYS-6-MTRACE: mallocfree: addr, pc
  B215238,1188D4C   B215238,40000294  A68F4F8,1188D24   A68F4F8,4000020A
  99D9E90,18B1894   A6A83F4,60000066  A6A7E18,18B6A7C   B652704,60000C5E
*Jul 23 13:50:01.887: %SYS-6-MTRACE: mallocfree: addr, pc
  B651D18,18B6A7C   B652F2C,6000084A  B652704,11F69F8   B652704,11F68B0 
  B652704,400003FC  AB95630,60000044  AB954D8,1AE99E0   AB954D8,1AE9950 
*Jul 23 13:50:01.887: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 99D9E60, words 25914424, alloc 0, Free, dealloc 0, rfcnt 99CCDA8
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E60: 0x0 0x0 0x0 0x0
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E70: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 13:50:01.887: %SYS-6-MEMDUMP: 0x99D9E80: 0x99CCDA8 0x0 0x0 0x0

 13:50:01 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 
*Jul 23 00:43:57.587: %LINK-5-CHANGED: Interface Dot11Radio0, changed state to administratively down
*Jul 23 00:43:57.603: %DOT11-5-EXPECTED_RADIO_RESET: Restarting Radio interface Dot11Radio0 due to the reason code 10
*Jul 23 00:43:57.607: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to up
*Jul 23 00:43:58.595: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio0, changed state to down
*Jul 23 00:43:58.623: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to down
*Jul 23 00:43:58.631: %LINK-5-CHANGED: Interface Dot11Radio0, changed state to reset
*Jul 23 00:43:59.655: %LINK-6-UPDOWN: Interface Dot11Radio0, changed state to up
*Jul 23 00:44:00.655: %LINEPROTO-5-UPDOWN: Line protocol on Interface Dot11Radio0, changed state to up
*Jul 23 11:35:25.767: %SYS-3-BADMAGIC: Corrupt block at 9926BD0 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 11:35:25.767: %SYS-6-MTRACE: mallocfree: addr, pc
  AFF3B2C,1188D4C   AFF3B2C,40000294  AB8CAE4,1188D24   AB8CAE4,4000020A
  9926C00,18B1894   AFF4108,60000066  AFF3B2C,18B6A7C   A6DEFCC,600004DE
*Jul 23 11:35:25.767: %SYS-6-MTRACE: mallocfree: addr, pc
  A6DDBF4,500004DE  A6DE5E0,18B6A7C   8D2CEBC,1AE99E0   8D2CEBC,1AE9950 
  8D2CEBC,300000AA  8D2CEBC,1AE99E0   8D2CEBC,1AE9950   8D2CEBC,300000AA
*Jul 23 11:35:25.767: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 9926BD0, words 25914424, alloc 0, Free, dealloc 0, rfcnt 991944C
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BD0: 0x0 0x0 0x0 0x0
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BE0: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 11:35:25.767: %SYS-6-MEMDUMP: 0x9926BF0: 0x991944C 0x0 0x0 0x0

 11:35:25 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 
*Jul 23 12:34:25.599: %CLEANAIR-6-STATE: Slot 0 enabled
*Jul 23 12:34:27.363: %CLEANAIR-6-STATE: Slot 1 enabled
*Jul 23 12:34:44.883: %CDP_PD-4-POWER_OK: Full power - NEGOTIATED inline power source
*Jul 23 12:46:26.643: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=51B36A4, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 12:46:26.747: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=5167920, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 12:46:26.851: %SYS-2-BADSHARE: Bad refcount in datagram_done, ptr=51B36A4, count=0
-Traceback= 11B72C8z 1228388z 1540018z 1B85810z 18660C4z 18B0774z 18B1D00z 1933074z 18DD9E4z 18B32C4z 18B5F64z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-3-BADMAGIC: Corrupt block at 9C57234 (magic 00000000)
-Traceback= 11B72C8z 12C007Cz 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-6-MTRACE: mallocfree: addr, pc
  AA8BA50,1188D4C   AA8BA50,40000294  AA8D330,1188D24   AA8D330,4000020A
  9C57264,18B1894   AA8D330,18B6A7C   AA8BA50,18B6A7C   7B027BC,14882C4 
*Jul 23 13:48:46.199: %SYS-6-MTRACE: mallocfree: addr, pc
  7B027BC,4000010C  7B02FE4,60000DC0  7B027BC,11F69F8   7B027BC,11F68B0 
  7B027BC,400003FC  7B02FE4,60000DC0  7B027BC,11F69F8   7B027BC,11F68B0 
*Jul 23 13:48:46.199: %SYS-6-BLKINFO: Corrupted magic value in in-use block blk 9C57234, words 25914424, alloc 0, Free, dealloc 0, rfcnt 9C49AB0
-Traceback= 11B72C8z 12A1370z 12C05C0z 12A5A88z 2F09564z 18B1898z 1ADE228z 18FF274z 18B3410z 18AD8BCz 1842048z 1842200z 1828620z 13483B4z 132F13Cz
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57234: 0x0 0x0 0x0 0x0
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57244: 0x0 0x15A3C78B 0x1 0x18B6C38
*Jul 23 13:48:46.199: %SYS-6-MEMDUMP: 0x9C57254: 0x9C49AB0 0x0 0x0 0x0

 13:48:46 UTC Thu Jul 23 2020: Unexpected exception to CPU: vector 700, PC = 0x1346F44 , LR = 0x1346EDC 

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 

Seems like there is a bit of variety here. That said, this didn't start until we put the AC modules in. Currently we are on 8.5.160, but this was happening on .151 as well.

At first I thought maybe there was just some port-channel saturation going on that was causing the capwap packets to get dropped for too long, but the AP crashes suggest that this isn't the case. Regardless, I changed the native VLAN away and removed some VLANs from the trunk, just in case some of our other traffic was bottle necking it.

Worth mentioning is that one of the events that happened at 35 minutes after did occur at the same time as a slight spike in internet traffic on that APs port, so it was under load when the reset happened. I am wondering if that is why the other APs went at the same time, all within similar area; guess is that the client downloading loaded 1 AP, then failed to another one and caused load there, and again etc, each time causing the APs to perhaps draw not enough power and brownout, thus the memory issues.

Now, if this is the case, shouldn't 20w on an 3750E switch with enhanced pow be sufficient to power these AC modules?

Any tips or advice would be appreciated!

39 REPLIES 39
marce1000
VIP Advisor

 

 - Looks strongly related to software bug(s), try the latest or advised release for this wireless platform.

  M.

Do you have any specific reason to lean towards a bug? We already tried going from .151 to .160, which didn't appear to solve anything.

Also, I would tend to thing that a software bug would cause at least somewhat identical crashes, instead of random ones. This is why I was leaning towards a brownout type problem where the power is insufficient. But the cable lengths should be around 30m - 50m, and some of the APs that haven't done it yet are on longer cables than the ones that have, so I can't really go by that theory.

Ultimately just trying to avoid buying unnecessary poe+ hardware for a test, and I am not a fan of having to disable 2.4ghz radios for testing power budget, nor is it ideal that APs are randomly rebooting whenever something does anything remotely related to a download.

Rafael E
Cisco Employee

Hello,

 

Looking at the datasheet for 3600. you can power the AC module with 15.4W if 2.4Ghz radio is down or with 19.6W if 2.4 and  5GHz bands are enabled.  

I don't think the power is the problem here, no i have seen power issues related to memory problems on the AP. 

 

Now from the logs, you share we need to find out what functions are causing the memory problem, from a show tech and sharing the crash files with TAC we should be able to do so, I recommend to open a SR. 

Saludos,
Rafael - TAC


@Rafael E wrote:

Hello,

 

Looking at the datasheet for 3600. you can power the AC module with 15.4W if 2.4Ghz radio is down or with 19.6W if 2.4 and  5GHz bands are enabled.  

I don't think the power is the problem here, no i have seen power issues related to memory problems on the AP. 

 

Now from the logs, you share we need to find out what functions are causing the memory problem, from a show tech and sharing the crash files with TAC we should be able to do so, I recommend to open a SR. 


Thanks for the quick and insightful reply!

Ya I was under the same impression that we were getting enough power with the 20w.

As for the support case, unfortunately the APs/WLC aren't covered with any smartnet contract, so we wouldn't be able to open up a case with TAC, afaik. Although I would be glad to share the crash logs with TAC if they think they are of use. I would post them somewhere, but of course I am hesitant to publicly post any crash logs due to the potentially sensitive nature of memory dumps containing important business or network information, such as wifi passwords etc.

Any idea how to proceed here?

 I am sorry but I do not know any other way CU can share their information in a secure way but via a SR. 

If you are still working with your account team, reach out to them. They might give you another solution. 

 

Saludos,
Rafael - TAC

You mentioned trying to find the function from the back trace. Would this be what you are after:

---- Partial decode of process block ----

Pid 44: Process "Dot11 driver "
stack 0x7936D70  savedsp 0x4E80E04 
Flags: analyze prefers_new wakeup_posted 
Status     0x00000000 Orig_ra   0x01467230 Routine    0x0133A458 Signal 0
Caller_pc  0x00000000 Callee_pc 0x00000000 Dbg_events 0x00000000 State  0
Totmalloc  6640164    Totfree   6531268    Totgetbuf  0       
Totretbuf  0          Edisms    0x0        Eparm 0x0       
Elapsed    0x5298     Ncalls    0x3AA24    Ngiveups 0x8       
Priority_q 4          Ticks_5s  524        Cpu_5sec   245      Cpu_1min 267
Cpu_5min   262        Stacksize 0x2EE0     Lowstack 0x2EE0    
Ttyptr     0x4E56920  Mem_holding 0x3E75C    Thrash_count 0
Wakeup_reasons      0x0FFFFFFF  Default_wakeup_reasons 0x0FFFFFFF
Direct_wakeup_major 0x00000000  Direct_wakeup_minor 0x00000000
Regs R14-R31, CR, PC, MSR at last suspend; R3 from proc creation, PC unused:
     R3 : 00000000  R14: 033C886C  R15: 07939B78  R16: 04A110E8  R17: 033C865C 
     R18: 033C8648  R19: 00024000  R20: 07939B28  R21: 07BEC744  R22: 00000000 
     R23: 00000000  R24: 07D28F58  R25: 07BEBE1C  R26: 00000008  R27: 07C78F58 
     R28: 07B041AC  R29: 07939FE8  R30: 00000000  R31: 03840000  CR: 24000042 
     PC : 012D27FC  MSR: 00029200 
========== Stack Trace========

-Traceback= 0x1346F44z 0x1346EDCz 0x12A5A88z 0x2F09564z 0x18B1898z 0x1ADE228z 0x18FF274z 0x18B3410z 0x18AD8BCz 0x1842048z 0x1842200z 0x1828620z 0x13483B4z 0x132F13Cz 

========= Context ======================

C3600 Software (AP3G2-K9W8-M), Version 15.3(3)JF11, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Compiled Sun 08-Dec-19 00:42 by prod_rel_team
Signal = 23, Vector = 700

MSR: 0x00021200 SRR0 = 0x01466848 SRR1 = 0x00021200 DEAR = 0x00000000

CPU Register Context:
PC  = 0x01346F44  MSR = 0x00029200  CR  = 0x28000082  LR    = 0x01346EDC
CTR = 0x0132C958  XER = 0x00000000  DAR = 0x00000000  DSISR = 0x400F0000
DEC = 0x00000000  TBU = 0x00000045  TBL = 0xD381C7A5  IMMR  = 0x00000000
R0  = 0x01346EDC  R1  = 0x07939930  R2  = 0x00010001  R3    = 0x037FB8B4
R4  = 0x00000000  R5  = 0x00000000  R6  = 0x032F4DE0  R7    = 0x03D30000
R8  = 0x00000004  R9  = 0x03800000  R10 = 0x00029200  R11   = 0x00000000
R12 = 0x24000048  R13 = 0xE43C1631  R14 = 0x030234A8  R15   = 0x033D02F8
R16 = 0x033D02CC  R17 = 0x00000000  R18 = 0x033D02AC  R19   = 0x033D02F4
R20 = 0x07D58F58  R21 = 0x07D695DA  R22 = 0x018AD854  R23   = 0x03831C60
R24 = 0x03830000  R25 = 0x03831C48  R26 = 0x00000000  R27   = 0x00000000
R28 = 0x03D6994C  R29 = 0x03839D00  R30 = 0x09A7990C  R31   = 0x00000000
          Exception memory

Alloc PC        Size     Blocks      Bytes    What

0x122576C 0000001216 0000000007 0000008512    *Free Packet Header*
0x12B8074 0000024420 0000000001 0000024420    (fragment) (Free Blocks)
0x14882C4 0000000532 0000000133 0000070756    Dot11 driver
0x1489FDC 0000020000 0000000001 0000020000    Flashfs Sector 
0x0       0000000000 0000000141 0000099268    Pool Summary
0x0       0000000000 0000000001 0000024420    Pool Summary (Free Blocks)
0x0       0000000052 0000000142 0000007384    Pool Summary(All Block Headers)
------------------ show chunk failures ------------------


When      Cause   Caller     In-use    Chunk      Name 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 
00:00:00      1  0x18B20A4      90   0x7D98884   chunk_wl_mgmt_r 

Not really able to understand much, but seems like something related to "Dot11", which I assume is radio related? I have been noticing that the radios sometimes come up only with 20mhz in the 5ghz band, despite being at "best", or even after forcing to 40mhz. I just assumed that DCA was veto'ing my decision :)

Scott Fella
Hall of Fame Master

Modules tend to crash access points if not tighten down properly. This is a known issue. However, make sure the connection where the module and the module connectors are clean of dust and particles. Then tighten the module down pretty hard but only with your fingers, don’t use any tools. Reseating the midwives and tightening then down is the fix for this 90% of the time.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
Modules tend to crash access points if not tighten down properly. This is a known issue. However, make sure the connection where the module and the module connectors are clean of dust and particles. Then tighten the module down pretty hard but only with your fingers, don’t use any tools. Reseating the midwives and tightening then down is the fix for this 90% of the time.

Actually, this had crossed my mind and I already asked my remote hands (in another country across the ocean) to take peek, as the APs were actually mounted in the ceiling tiles, with the port facing upwards, thus dirt and dust was a concern. Also I had noticed that the first one that was installed by an electrically geared engineer, was perfect so far, but the other APs were installed by a coder (tsk!), so they very well may just be gently seated in place, instead of fastened tightly!

I have left them a note to take a look, so we will know soon enough. This actually does seem plausible, especially given the somewhat random nature of the bugs. After all, if it were a code or setting bug, you would think that all the errors would be the exact same. As well, there only some of the APs that seemed to do it.

Thanks for the tip, btw! I was curious if you see this often? I never found such a reference, but it does make sense. Anyway, thanks, and I'll let you know how it goes!

We have ran into this a lot and Tac also knows about this. We ended up removing the modules just so we have stability.
-Scott
*** Please rate helpful posts ***


@Scott Fella wrote:
We have ran into this a lot and Tac also knows about this. We ended up removing the modules just so we have stability.

Curious, but did you end up removing the modules because you didn't know at the time? Or because you just were never able to get them to seat properly and get a good connection, and thus abandoned their use entirely and went with like a 3700 or something?

If it weren't for the fact that you can pick up 3602i's for the price of a pizza slice and a pop, I would have gladly went with the 3700's with a WSSI instead of a 3600 with AC, but it is what it is! :)

Well I knew about the issue way back when the modules were available, but I was a consultant back then and knew about the reboot. When I changed to a different company and started reviewing ap uptime, that’s when I started noticing this more and sending out techs to replace or reseat them. After a while I started noticing the same aps bouncing even after the module was replaced, swapped or reseated so decided to remove them and the reboots started to get reduced. So now when a tech is onsite to troubleshoot, if the ap has a module, it is removed. If the ap has to be RMA’d, the module was not put back. We ended up replacing all the 3602’s with 3702’s.
-Scott
*** Please rate helpful posts ***

Look at some of these links:

https://quickview.cloudapps.cisco.com/quickview/bug/CSCuo03213

https://www.cisco.com/c/en/us/td/docs/wireless/access_point/3600/module_install/3600_AC_module.html#78118

Caution The thumbscrews used to mechanically secure the radio module into the AP are an integral part of the radio module detection circuitry within the AP. Therefore, to ensure reliable radio module detection by the AP the thumbscrews must be installed to their maximum travel depth and tightened to the maximum tightness that can be achieved by hand.
If the thumbscrews are not properly installed and appropriately tightened the radio module may not be detected by the AP and/or the AP may reboot unpredictably, for instance, when subjected to mechanical shock or vibration. However, do not use any tool or device intended to provide mechanical advantage, to tighten the thumbscrews. Doing so can result in permanent damage to the thumbscrews, radio module, or AP.


-Scott
*** Please rate helpful posts ***

Interesting that they are part of the detection. I am guessing they help with the ground or something. Either way, still seems a little odd that they would detect fine until they are in use, and it be the screws not connected at all. My guess is more leaning towards not tightened enough, or dust that is blocking part of the pins. That said, I swore the APs had a sticker over the slots when I left them last, so we'll see.

Thanks for all the tips! We'll see soon enough, but might take a business day or two for them to get around to it, then to wait and see what the results are.

It’s something that has been a pain for me for many years. Keep tabs on the show ap uptime once a week or so and track what ap bounces. If you see an ap bounce a few times a month or even a year, it may help determine what is the culprit depending on what was the root cause.
While you are at it, get some canned air and also clean the Ethernet port and the rj45 end, in case it is dusty or dirty from the install.
-Scott
*** Please rate helpful posts ***
Content for Community-Ad