cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3240
Views
4
Helpful
14
Replies

9166I stuck in boot loop

JTRorick
Level 1
Level 1

I recently installed 117 9166 Aps for a client in a flex environment.  Most came up perfectly but 10 did not.  They are doing the following.

They find and reach the 9800-40 fine and then this happens

AP IPv4 Address updated from 0.0.0.0 to 10.232.3.32
[*02/01/2024 02:06:44.7810] dtls_init: Use SUDI certificate
[*02/01/2024 02:06:45.0271]
[*02/01/2024 02:06:45.0271] CAPWAP State: Init
[*02/01/2024 02:06:46.0805] Start: RPC thread 2785309584 created.
[*02/01/2024 02:07:04.1777] Set PnP NTP Server pnpntpserver.crm.doj.gov.
[*02/01/2024 02:07:34.3156] PNP:Server not reachable, Start CAPWAP Discovery
[*02/01/2024 02:07:34.3246]
[*02/01/2024 02:07:34.3246] CAPWAP State: Discovery
[*02/01/2024 02:07:34.3254] Got WLC address 10.232.132.65 from DHCP.
[*02/01/2024 02:07:34.3255] IP DNS query for CISCO-CAPWAP-CONTROLLER.crm.doj.gov
[*02/01/2024 02:07:34.3497] Discovery Request sent to 10.232.132.65, discovery type DHCP(2)
[*02/01/2024 02:07:34.3513] Discovery Request sent to 255.255.255.255, discovery type UNKNOWN(0)
[*02/01/2024 02:07:34.3576] Discovery Response from 10.232.132.65
[*02/01/2024 02:07:34.3595]
[*02/01/2024 02:07:34.3595] CAPWAP State: Discovery
[*02/05/2024 16:07:42.0002] Started wait dtls timer (60 sec)
[*02/05/2024 16:07:42.0106]
[*02/05/2024 16:07:42.0106] CAPWAP State: DTLS Setup
[*02/05/2024 16:07:42.0430] dtls_verify_server_cert: Controller certificate verification successful
[*02/05/2024 16:07:42.3957] sudi99_request_check_and_load: Use HARSA SUDI certificate
[*02/05/2024 16:07:42.7229]
[*02/05/2024 16:07:42.7229] CAPWAP State: Join
[*02/05/2024 16:07:42.8414] shared_setenv PART_BOOTCNT 0 &> /dev/null
[*02/05/2024 16:07:43.0798] DOT11_CFG[0]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 0 radioFraEnabled 1, radioSubType 0, serviceType 0, radioType 1, bandId 0
[*02/05/2024 16:07:43.0803] DOT11_CFG[1]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 1 radioFraEnabled 1, radioSubType 0, serviceType 0, radioType 2, bandId 1
[*02/05/2024 16:07:43.0806] DOT11_CFG[2]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 2 radioFraEnabled 1, radioSubType 4, serviceType 0, radioType 18, bandId 2
[*02/05/2024 16:07:43.0809] OOBImageDnld: OOB Image Download in ap_cap_bitmask(2)
[*02/05/2024 16:07:43.0812] Sending Join request to 10.232.132.65 through port 5256, packet size 1376
[*02/05/2024 16:07:47.6966] DOT11_CFG[0]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 0 radioFraEnabled 1, radioSubType 0, serviceType 0, radioType 1, bandId 0
[*02/05/2024 16:07:47.6970] DOT11_CFG[1]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 1 radioFraEnabled 1, radioSubType 0, serviceType 0, radioType 2, bandId 1
[*02/05/2024 16:07:47.6974] DOT11_CFG[2]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 2 radioFraEnabled 1, radioSubType 4, serviceType 0, radioType 18, bandId 2
[*02/05/2024 16:07:47.6976] OOBImageDnld: OOB Image Download in ap_cap_bitmask(2)
[*02/05/2024 16:07:47.6978] Sending Join request to 10.232.132.65 through port 5256, packet size 1376
[*02/05/2024 16:07:52.4454] DOT11_CFG[0]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 0 radioFraEnabled 1, radioSubType 0, serviceType 0, radioType 1, bandId 0
[*02/05/2024 16:07:52.4458] DOT11_CFG[1]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 1 radioFraEnabled 1, radioSubType 0, serviceType 0, radioType 2, bandId 1
[*02/05/2024 16:07:52.4461] DOT11_CFG[2]: Sending TLV_DOT11_RADIO_TXRX_CAPABILITY slotid 2 radioFraEnabled 1, radioSubType 4, serviceType 0, radioType 18, bandId 2
[*02/05/2024 16:07:52.4463] OOBImageDnld: OOB Image Download in ap_cap_bitmask(2)
[*02/05/2024 16:07:52.4465] Sending Join request to 10.232.132.65 through port 5256, packet size 896
[*02/05/2024 16:07:52.4567] Join Response from 10.232.132.65, packet size 917
[*02/05/2024 16:07:52.4568] AC accepted previous sent request with result code: 0
[*02/05/2024 16:07:52.4569] Received wlcType 0, timer 30
[*02/05/2024 16:07:52.5627]
[*02/05/2024 16:07:52.5627] CAPWAP State: Image Data
[*02/05/2024 16:07:52.5633] AP image version 17.9.4.27 backup 17.9.1.8, Controller 17.9.4.27
[*02/05/2024 16:07:52.5634] Version is the same, do not need update.
[*02/05/2024 16:07:52.6048] status 'upgrade.sh: Script called with args:[NO_UPGRADE]'
[*02/05/2024 16:07:52.6399] do NO_UPGRADE, part2 is active part
[*02/05/2024 16:07:52.6634]
[*02/05/2024 16:07:52.6634] CAPWAP State: Configure
[*02/05/2024 16:07:52.6704] Telnet is not supported by AP, should not encode this payload
[*02/05/2024 16:07:53.3575] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: Configure(8).
[*02/05/2024 16:07:56.0553] Re-Tx Count=1, Max Re-Tx Value=5, SendSeqNum=1, NumofPendingMsgs=1
[*02/05/2024 16:07:56.0553]
[*02/05/2024 16:07:58.9062] Re-Tx Count=2, Max Re-Tx Value=5, SendSeqNum=1, NumofPendingMsgs=1
[*02/05/2024 16:07:58.9062]
[*02/05/2024 16:08:01.2541] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: Configure(8).
[*02/05/2024 16:08:01.7572] Re-Tx Count=3, Max Re-Tx Value=5, SendSeqNum=1, NumofPendingMsgs=1
[*02/05/2024 16:08:01.7572]
[*02/05/2024 16:08:02.1754] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: Configure(8).
[*02/05/2024 16:08:04.6082] Re-Tx Count=4, Max Re-Tx Value=5, SendSeqNum=1, NumofPendingMsgs=1
[*02/05/2024 16:08:04.6082]
[*02/05/2024 16:08:07.4592] Re-Tx Count=5, Max Re-Tx Value=5, SendSeqNum=1, NumofPendingMsgs=1
[*02/05/2024 16:08:07.4592]
[*02/05/2024 16:08:10.3101] Max retransmission count exceeded, going back to DISCOVER mode.
[*02/05/2024 16:08:10.3101] Dropping msg CAPWAP_CONFIGURATION_STATUS, type = 4, len = 3654, eleLen = 3662, sendSeqNum = 1
[*02/05/2024 16:08:10.3104] GOING BACK TO DISCOVER MODE
[*02/05/2024 16:08:10.3306] OOBImageDnld: OOBImageDownloadTimer expired for image download..
[*02/05/2024 16:08:10.3307] OOBImageDnld: Do common error handler for OOB image download..
[*02/05/2024 16:08:10.3633]
[*02/05/2024 16:08:10.3633] CAPWAP State: DTLS Teardown
[*02/05/2024 16:08:10.3983] OOBImageDnld: Do common error handler for OOB image download..
[*02/05/2024 16:08:10.4912] status 'upgrade.sh: Script called with args:[CANCEL]'
[*02/05/2024 16:08:10.5256] do CANCEL, part2 is active part
[*02/05/2024 16:08:10.5492] status 'upgrade.sh: Cleanup tmp files ...'
[*02/05/2024 16:08:10.5856] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: DTLS Teardown(4).
[*02/05/2024 16:08:10.5857] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: DTLS Teardown(4).
[*02/05/2024 16:08:15.0830] OOBImageDnld: OOBImageDownloadTimer expired for image download..
[*02/05/2024 16:08:15.0830] OOBImageDnld: Do common error handler for OOB image download..
[*02/05/2024 16:08:15.1049] dtls_queue_first: Nothing to extract!
[*02/05/2024 16:08:15.1049]
[*02/05/2024 16:08:25.1190]

 

 

 

I then repeats this forever.  I have tried factory resetting, but it continues to occur.  The main problem that I am having is that this has only happened in  10 APs, all over the building in different access switches, with another AP working fine in same port.

Software version : 17.9.4.27

 

14 Replies 14

Leo Laohoo
Hall of Fame
Hall of Fame

How many APs does the controller have right now?

Is the controller in an HA SSO?

JTRorick
Level 1
Level 1

134 and yes HA RP+RMI

If the APs are Local mode, can/do they join the WLC?

Mark Elsen
Hall of Fame
Hall of Fame

 

              - What is the controller model and software version ?
              - Does it have sufficient licenses ?

 M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

9800-40

Software version : 17.9.4.27

Yes plenty of licenses

 

 

 - Check controller logs too when the AP tries to join.
 - Have a checkup of the 9800 controller configuration with the CLI command show tech wireless and feed the output into
                                                                                                                    Wireless Config Analyzer

 M.
                                                                                                            



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

Mark Elsen
Hall of Fame
Hall of Fame

 

 - Added reply ; if the issue is still ongoing have a try with ; https://software.cisco.com/download/home/286316412/type/282046477/release/IOSXE-17.13.1 
     the reason being that since you are mentioning  flex based APs  ,possibly reaching the controller over a WAN link:
           17.3.x has improved code to prevent   image corruption & problems when downloading over none-intranet based networks.

 M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

Yes i am still have the issue.  I have found the cause https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/220443-how-to-avoid-boot-loop-due-to-corrupted.html

But i am still haveing the issue as I cannot breaking to the U-boot on the 9166.  If you look at the report if gives a work around for every 91xx modle but the 9166

 

Workaround (for APs already in boot loop)

For AP Models 1800, 2800, 3800, 4800, 1560, 9117, 9124, 9130, 9136 

  1. Power up the AP and connect to AP via console.
  2. Boot the AP, break to U-BOOT by hitting 'ESC'. This should bring you to (u-boot)> or (BTLDR)#prompt. 
  3. Run these commands
(u-boot)> OR (BTLDR)# setenv mtdids nand0=nand0 && setenv mtdparts mtdparts=nand0:0x40000000@0x0(fs) && ubi part fs
(u-boot)> OR (BTLDR)# ubi remove part1  (or part2 if corrupted image is in part2) 
(u-boot)> OR (BTLDR)# ubi create part1  (or part2 if corrupted image is in part2)      
(u-boot)> OR (BTLDR)# boot 

 

For AP Models 9105, 9115, 9120

  1. Power up the AP and connect to AP via console.
  2. Boot the AP, break to U-BOOT by hitting 'ESC'. This should bring you to (u-boot)> prompt.
  3. Run these commands
(u-boot)> ubi part fs 
(u-boot)> ubi remove part1  (or part2 if corrupted image is in part2) 
(u-boot)> ubi create part1  (or part2 if corrupted image is in part2) 
(u-boot)> boot

And i cannot figure out how to break in to the 9166-  the standard 'Esc' doesn't work and neither does any button that I have tried.  I also cannot locate any documentation from cisco on how to do this.

 

  - Indeed , I have been looking into that too , I think the mentioned procedures are no supported for the 9166 as of now ; therefore my advice to have a test with 17.3.1 where (extra) measures are built in natively to prevent boot loop due to image corruption (e.g.)

 M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

Mark Elsen
Hall of Fame
Hall of Fame

 

  - Correction on earlier replies ; meaning 17.13.1 when talking about included measures to prevent boot loop due to image corruption  on APs (not 17.3)

 M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

Rich R
VIP
VIP

From Cisco Live Amsterdam 9800 Troubleshooting session a few days ago:
CSCvx32806 : Bootloop of death
CAPWAP image download is, by default, slow and somewhat unreliable (UDP).
Risk of bootloop in versions before 17.9.3, especially when image is downloaded over WAN :
https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/220443-how-to-avoid-boot-loop-due-to-corrupted.html
Fix : Now image is properly verified during download.
17.13 has a complete corruption verification and prevention system

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvx32806 specifically mentions 9164 and 9166 saying Follow "archive download" step from that link - meaning initiate manual upgrade of AP image from TFTP/SFTP on CLI.  You'll want to make sure that the AP cannot join a WLC while you do that otherwise it will keep reloading preventing the image download.

If all else fails open a TAC case - they've already had over 200 for this issue!

JTRorick
Level 1
Level 1

Thank you all for your help.  I have found a workaround.  It is cumbersome.  I was unable to upgrade the code on the production 9800 due to customer requirements so I did the following,

Got a secondary controller setup on the local LAN and pointed only the problematic 9166s (10 of them out of 117).  Loaded that controller with an older version on code (17.9.3) and loaded that on top of the bugged code in the APs.  Then loaded another older version of code on the controller (17.9.1) to make sure both the primary and backup version on the APs were overwritten.  Finally loaded 17.9.4a on the local controller and loaded the code on them.  Once they had been per-loaded with the code I pointed them back to the production controller and the worked.

This was a very cumbersome process that could have been avoided if their was a documented way to break in to U-Boot of the 9166.

 

Thanks again for your help.


@JTRorick wrote:
 Loaded that controller with an older version on code (17.9.3) and loaded that on top of the bugged code in the APs.  Then loaded another older version of code on the controller (17.9.1) to make sure both the primary and backup version on the APs were overwritten.

Well, this method does not make sense because the start of this thread, it talks about the AP going into a boot loop.  

If the AP is able to join controllers, it is not boot looping.  

If the objective is to overwrite the flash with different OS versions then there is a much simpler way and one without a controller. 

1.  Go to the Cisco website and download the AP firmware matching different controller version.  For instance, firmware version with a suffix of JPQ (for 17.12.1) &/or JPQ1 (for 17.12.2).  Put those two files into a TFTP server. 

2.  Console into the AP and in enable mode enter the following command:  archive download-sw /no-reboot tftp://<TFTP_IP_ADDRESS>/filename.tar

3.  Repeat step #2 for the 2nd firmware.  

4.  Reboot the AP.

@Leo Laohoo @JTRorick explained that they couldn't get into u-boot - that's why it was done that way.

It actually does make sense because the boot loop only happens when it tries to fully load the corrupted install after joining.  It generally gets to the join phase without crashing so this workaround works because it triggers a new download before it gets to the point where it crashes.  By forcing download with 2 different versions you effectively ensure you have overwritten all corrupted images on flash.  This is one of the workarounds for this problem.

Review Cisco Networking for a $25 gift card