Setup : 3504 WLC HA-pair, 8.5.140 and 2802i Access points
This is a new setup and after configuring the WLC and adding the APs, a handful APs did not join due to cabling problems. After these problems were resolved the APs tried to join but failed.
These were messages we saw in the WLC logs:
*spamApTask3: Jan 02 16:21:28.462: %CAPWAP-3-ENCODE_ERR: [SA]capwap_ac_sm.c:3333 The system has failed to encode Image data request (Requested Ap Image not found) to AP 70:b3:17:4d:96:c0
*spamApTask3: Jan 02 16:21:28.462: %CAPWAP-3-IMAGE_DOWNLOAD_ERR3: [SA]capwap_ac_platform.c:1525 Refusing image download request from Unsupported AP 70:b3:17:4d:96:c0 - unable to open image file /mnt/images/ap.run/ap3g3
*spamApTask2: Jan 02 16:21:28.225: %CAPWAP-3-DISC_AP_MGR_ERR1: [SA]capwap_ac_sm.c:2109 The system is unable to process Primary discovery request from AP 70:b3:17:3b:94:a0 on interface (8), VLAN (510), could not get IPv6 AP manager
To be clear : IPV6 is not configured anywhere in the network at this location.
Disabling IPv6 did not fix anything.
The AP logs showed this:
[*01/03/2019 09:44:01.4577] CAPWAP State: Discovery
[*01/03/2019 09:44:01.4580] Got WLC address 10.113.10.240 from DHCP.
[*01/03/2019 09:44:01.5140] Discovery Request sent to 10.113.10.240, discovery type DHCP(2)
[*01/03/2019 09:44:01.5149] Discovery Request sent to 255.255.255.255, discovery type UNKNOWN(0)
[*01/03/2019 09:44:01.5150] Discovery Response from 10.113.10.240
[*01/03/2019 09:44:10.0002] Discovery Response from 10.113.10.240
[*01/03/2019 09:44:10.0000] CAPWAP State: DTLS Setup
[*01/03/2019 09:44:10.0312] dtls_load_ca_certs: LSC Root Certificate not present
[*01/03/2019 09:44:10.0345] CAPWAP State: Join
[*01/03/2019 09:44:10.0357] Sending Join request to 10.113.10.240 through port 5248
[*01/03/2019 09:44:10.0397] Join Response from 10.113.10.240
[*01/03/2019 09:44:10.1148] HW CAPWAP tunnel is ADDED
[*01/03/2019 09:44:10.1338] CAPWAP State: Image Data
[*01/03/2019 09:44:10.1644] do PRECHECK, part1 is active part
[*01/03/2019 09:44:10.2981] Image Data Request sent to 10.113.10.240
[*01/03/2019 09:44:27.7254] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: Image Data(10).
[*01/03/2019 09:44:28.2660] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: Image Data(10).
[*01/03/2019 09:44:38.6979] Image download did not start for 30 seconds.
[*01/03/2019 09:44:38.6979] Restarting capwap - image download cannot start.
[*01/03/2019 09:44:38.6980] Lost connection to the controller, going to restart CAPWAP...
[*01/03/2019 09:44:38.6982] Restarting CAPWAP State Machine.
[*01/03/2019 09:44:38.7028] Discarding msg CAPWAP_WTP_EVENT_REQUEST(type 9) in CAPWAP state: Image Data(10).
[*01/03/2019 09:44:38.7040] CAPWAP State: DTLS Teardown
[*01/03/2019 09:44:39.7144] Dropping dtls packet since session is not established. Peer 10.113.10.240-5246, Local 10.113.10.103-5248, conn (nil)
[*01/03/2019 09:44:44.4430] do ABORT, part1 is active part
After finding this bug : https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvm69246 I remembered we had to SSO the WLC to change the power input of the primary unit. Aftwerwards we had to SSO again so the primary was active again. What seemd to help is to reset both units again with a reset self and force another round of SSO:
pri : reset self -> failover to sec
sec : after pri is back again -> reset self - failover to primary
After the last reset the remaining APs were able to download the images.
Hope this might be useful when bumping into this problem. Based on the logs we weren't abel to find anything except the bug in bug search.
WLC is synced via NTP and the APs get the same NTP server via DHCP. Disregard the timestamps in the logs because I included them to give an example of the output. These logs were collected yesterday and today.
With kind regards,
I might suspect the time and date between the WLC and this AP, co please check by console to the AP and the WLC and validate it by show commands.
If it's ok, so I will put myself in your place, and break the SSO (for a very small window time) and test the AP joining with the primary controller with observing the logs during this process, I suspect the AP firmware corruption in this WLC for some reasons, so the AP is not able to download the image.
When we started we had time issues and that's why we added option 42 to the DHCP scope. This solved the problem at that time. 38 of the 43 APs joined without a problem. The last 5 had layer 1 (cabling) issues and those have been resolved last week.
The remaining APs were unable to join the primary unit, a 'reset self' on the primary forced a failover to the secundary WLC but that did not fix the issue. another reset self on the active unit (secundary unit) fixed the issue. So seeing the problem on both units seems to rule out firmware corruption.
with kind regards.
Great, this eliminates the firmware issue.
If I were in your place, I would do this:
1- Hard Reset the AP manually,
2- Update/Change the firmware on these 5x APs manually, by downloading the new firmware and upload it to the AP by tftp.
3- if problem still exist, ( i doubt), bring a virtual controller in your lab, and join these APs to this lab controller), if joined, so this means that issue is with controller.