cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1676
Views
0
Helpful
4
Replies

WAP121 hostapd uses all CPU, crashes every 3.5 days

Chris Russell
Level 1
Level 1

Has anyone else seen anything like this?

We have 4 WAP121s running 1.0.1.10. After a couple of days of operation, they all become unusably slow. The admin web interface takes several minutes to load, and wireless clients timeout before RADIUS authentication can complete. Waiting long enough allows the diag info to be downloaded, which shows hostapd using all of the CPU time available:

Top:

Mem: 44256K used, 17272K free, 0K shrd, 4748K buff, 19148K cached

CPU: 99% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq

Load average: 9.36 8.16 7.68 11/40 1404

PID PPID USER     STAT   VSZ %MEM CPU %CPU COMMAND

1137   870 root     R <   2612   4%   0 98% /usr/sbin/hostapd /tmp/hostapd.con

870     1 root     R < 10068 16%   0   1% /usr/sbin/dman

1358 1123 root     S <   3404   6%   0   1% admin.cgi

1403 1375 root     R <   1808   3%   0   1% top -n 1 -b

1069   870 root     R < 13048 21%   0   0% /usr/sbin/snmpd -f -c /tmp/snmpd.c

1124   870 root     S <   5420   9%   0   0% /usr/bin/mini_httpd-ssl -S -E /etc

1123   870 root     S <   5376   9%   0   0% /usr/bin/mini_httpd-ssl -D -p1 80

1360 1358 root     S <   5376   9%   0  0% /usr/bin/mini_httpd-ssl -D -p1 80

1373 1372 root     S <   5376   9%   0   0% /usr/bin/mini_httpd-ssl -D -p1 80

797     1 root     R <   4668   8%   0   0% /usr/sbin/tspec

1372 1123 root     S <   2468   4%   0   0% download.cgi diag-log

   1    0 root     S <   1812   3%   0   0% init

1369   870 root     S <   1812   3%   0   0% /sbin/udhcpc -l /tmp/udhcpc.lease.

1104   870 root     S <   1812   3%   0   0% /sbin/getty -L 115200 ttyS0

1374 1372 root     S <   1808   3%   0   0% sh -c sh /sbin/show_diagnostics.sh

1375 1374 root     S <   1804   3%   0   0% sh /sbin/show_diagnostics.sh

721     1 root     D <   1800   3%   0   0% insmod /lib/modules/2.6.21.5/extra

783     1 root     R <   1800   3%   0   0% insmod /lib/modules/2.6.21.5/extra

752     1 root     R <   1800   3%   0   0% insmod /lib/modules/2.6.21.5/extra

720     1 root     D <   1800   3%   0   0% insmod /lib/modules/2.6.21.5/extra

768     1 root     D <   1800   3%   0   0% insmod /lib/modules/2.6.21.5/extra

1393   870 root     R <   1008   2%   0   0% /usr/sbin/sntp -s 3600 0 crdc01.sa

1371   870 root     S <   848   1%   0   0% /sbin/dhcp6c brtrunk 0 /usr/share/

1096   870 root     S <   668   1%   0   0% /usr/sbin/syslogd

882   870 root     R <   660   1%   0   0% /usr/bin/eapd

668     1 root     SW<     0   0%   0   0% [mtdblockd]

   3     1 root     SW<     0   0%   0   0% [events/0]

   4     1 root     SW<     0   0%   0   0% [khelper]

   2     1 root     SWN     0   0%   0   0% [ksoftirqd/0]

   49     5 root     SW<     0   0%   0   0% [unionfs_siod/0]

698     1 root     SWN     0   0%   0   0% [jffs2_gcd_mtd2]

   20     5 root     SW<     0   0%   0   0% [kblockd/0]

   21     5 root     SW<     0   0%   0   0% [kseriod]

   45     5 root     DW       0   0%   0   0% [pdflush]

   5     1 root     DW<     0   0%   0   0% [kthread]

   47     5 root     SW<     0   0%   0   0% [kswapd0]

   48     5 root     SW<     0   0%   0   0% [aio/0]

1399     5 root     SW<     0   0%   0   0% [kthread]

   46     5 root     RW       0   0%   0   0% [pdflush]

1404     5 root     RW<     0   0%   0   0% [kthread]

Unless rebooted the unit will continue in this state until about 3.5 days have passed since the last boot, when the unit will crash and restart, logging a "

CPU Unable to handle kernel paging request for process = swapper and pid = 0". The stack trace is slightly different each time, but the uptimes are pretty consistent:

crashdump1:

Software Version: 1.0.1.10

Crash Log:

Uptime of the AP: 3 Days, 12 Hours, 50 Minutes and 22 Seconds

CPU Unable to handle kernel paging request for process = swapper and pid = 0

crashdump2:

Software Version: 1.0.1.10

Crash Log:

Uptime of the AP: 3 Days, 15 Hours, 21 Minutes and 38 Seconds

CPU Unable to handle kernel paging request for process = swapper and pid = 0

crashdump3:

Software Version: 1.0.1.10

Crash Log:

Uptime of the AP: 3 Days, 13 Hours, 19 Minutes and 18 Seconds

CPU Unable to handle kernel paging request for process = swapper and pid = 0

crashdump4:

Software Version: 1.0.1.10

Crash Log:

Uptime of the AP: 3 Days, 17 Hours, 6 Minutes and 6 Seconds

CPU Unable to handle kernel paging request for process = swapper and pid = 0

As a workaround, we used to use a small script to reboot them every night via SSH, but someone at Cisco made a less-than-brilliant decision to remove this advertised feature.

4 Replies 4

Tom Watts
VIP Alumni
VIP Alumni

Hi Chris, this is a pretty interesting topic and finding.

I don't think you will get the attention you need using the forums. Judging this book by the cover, this looks like a potentially serious problem.

Would it be possible for you to collect a comprehensive data dump which includes-

  • Network topology, including

-ISP, modem model, router, switches, other access points, wireless LAN controllers, etc

-Servers, ACS, MS, UNIX, etc

-Number of connected clients and types, printers, computers (and OS)

-Length of wires from your switches to the APs

  • Config files and data dump

-Access point config files

-Access point logging

-Set up an external syslog and gather those logs

-Packet capture from the access points showing normal authentication

-RADIUS parameters (config file if possible)

-Working wired packet capture on the LAN when the AP seems to fail

-If able, a failing packet capture during the event from the AP

-If able, a pcap on the switchport the AP connects when fails

-RADIUS logs

  • Environment details

-Where are the access points installed

-INSIDDER scan for possible RF interferences

-Details of the environment around each AP, such as metal objects, walls, windows, elevation, cable distances, POE or non-POE

Other points of consideration, if using POE, what happens using power adapters?

Can you move the AP's to a closer proximity of the switches (if there is any kind of variable differences)?

Do you have any other similar deployments somewhere else that has the same issue or no other issues?

If you can collect most or all of this info, I'd recommend to give the SBSC a call to open a ticket and try to push the case up to the L2 team.

-Tom
Please rate helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

Hi,

Was this issue ever resolved? I'm having similar issues with my WAP121.

Kind regards

Michael

There's no solution. We threw them all in trash and put in another vendor's product.

Thanks for the reply Chris...I thought as much!

Never had so much trouble with any AP like this one...

Cheers

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: