09-06-2011 04:21 AM - edited 07-03-2021 08:40 PM
Hi there
Yesterday I upgraded from K9-6-0-188-0.aes to K9-7-0-116-0.aes. The controller has always been stable (86xLAP1131AG-E-K9, 2xports installed, maximum 250 users). After the upgrade it begun to reboot. Maximum uptime 5-20 minutes. Except for in the middle of the night with almost 6 hours uptime, until the users were back this morning. I have transfered some of the AP's to our secondary controller so "only" 61 are left to handle for 4404. And now it keeps running for a little longer between reboots.
The crash file starts with:
----------------------------------- cut -----------------------------------
************************************************************
* Dumping Registers *
************************************************************
NIP: 1000C0E8 XER: 20000000 LR: 1000C084 SP: 3B03AA10 REGS: 0x3b03a550 TRAP: 0300
MSR: 0002d000 EE: 1 PR: 1 FP: 0 ME: 1 IR/DR: 00
DAR: 00000000, DSISR: 00800000Stack:
1000c084 0000032b 1125d6b8 1a2eaef0 11f70000 11a90000 00000000 1a2eaef0
10cc8bdc 1a2eaef0 11265f14 1a2eaef0 11257d94 00000100 1a2eaef0 00000000
3113fbd8 00000100 10ccd9b8 11f50000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 706d616c 0000032b 00000006 13bf5584
3113f344 11970000 3113f7d8 3113f2ac 10cce01c 00000003 19e83020 1199e3fc
************************************************************
* Start Cisco Crash Handler *
************************************************************
Sys Name: wlc1
Model: AIR-WLC4404-100-K9
Version: 7.0.116.0
Timestamp: Tue Sep 6 12:18:29 2011
SystemUpTime: 0 days 0 hrs 57 mins 29 secs
signal: 11
pid: 963
TID: 720941
Task Name: sshpmMainTask
Reason: System Crash
si_signo: 11
si_errno: 0
si_code: 1
si_addr: 0x0
timer tcb: 0x4dc
timer cb: 0x101173d0 ('apfTimeGetInStr+88')
timer arg1: 0x18722194
timer arg2: 0x18722194
Long time taken timer call back inforamtion:
Time Stamp: Tue Sep 6 12:09:51 2011
timer cb : 0x10123f84 ('apfAaaUrlRedirectAclGetByMscb+220')
Duration : 18602 usecs, cbCount= 1
------------------------------------------------------------
Analysis of Failure:
Software was stopped for the following reason:
pmalloc detected memory corruption
------------------------------------------------------------
pmalloc memory corruption type: ++PMALLOC_ENTRYMAGIC_0_CORRUPTION
- Corruption detected at pmalloc entry address: (0x3113f2ac)
- Corrupt entry: entryMagic_0(0x0004), entryMagic_1(0xbe91be91),
trailer(0xead0ead0),poison(0x06a0),
entrysize(1024),bytes(992),thread(Unknown task name, task id = (331306184)),
file(sshbuffer.c),line(283), time(9),
previous access file(sshmp-integer-co)-line(220)----------------------------------- cut -----------------------------------
Anyone got an idea if we are talking about broken memory? Memory I did not hit earlier because 6-0-188-0 is smaller than 7-0-116-0?
Any comment might be helpful 🙂 Thanks in advance.
Message was edited by: Erik Qvam
Solved! Go to Solution.
09-06-2011 04:55 AM
I feel we are hitting the below bug..
| WLC crash : pmalloc memory corruption | |
Symptom:Software was stopped for the following reason: pmalloc detected memory corruption ------------------------------------------------------------ pmalloc memory corruption type: ++PMALLOC_ENTRYMAGIC_0_CORRUPTIONConditions: NAWorkaround: rebooted back with backup code after which wlc stays up | |
Please dont forget to rate the usefull posts!!
Regards
Surendra
09-06-2011 04:55 AM
I feel we are hitting the below bug..
| WLC crash : pmalloc memory corruption | |
Symptom:Software was stopped for the following reason: pmalloc detected memory corruption ------------------------------------------------------------ pmalloc memory corruption type: ++PMALLOC_ENTRYMAGIC_0_CORRUPTIONConditions: NAWorkaround: rebooted back with backup code after which wlc stays up | |
Please dont forget to rate the usefull posts!!
Regards
Surendra
09-06-2011 05:08 AM
Thank You. This might be the solution. The symptom is familiar, but I don't understand the workaround.
What does "rebooted back with backup code" meen?
Boot the "Emergency Image Version 7.0.116.0" ?
or
Install and boot, for instanse, 6.0.188.0 ?
Regards Erik
09-06-2011 05:12 AM
Yes its 6.0.188
issue the command
config boot backup
and issue the command "show boot" and see if the back up image is active and reboot the WLC its gonna boot with the backup!!
Lemme know if this helps and please dont forget to rate the usefull posts!!
Regards
Surendra
09-06-2011 05:30 AM
I can do that Surendra, but then I will not run 7.0.116.0 on both my controllers. Will I? The 4402 that upgraded flawlessly. And the 4404 that didn't upgrade that well.......
Can there be an upgrade path trough multiple versions that prevents the CSCtr17396 bug? Downgrading shouldn't be the answer. I think.
Anyway. Thank you for pointing me in the right direction. I will probably try 7.0.98.x or 6.0.202.0 since CSCtr17396 first was found in 7.0.116.0.
I will try to remember to post a follow-up if 7.0.98.x was the right way to go.
Regards
Erik
09-06-2011 05:35 AM
Feel free to open up a TAC case to get the RCA done on the same.. Can u do this.. upload 7.0.98.218 and then boot the image with 7.0.98.218 and then boot hthe WLC with the backup (Which will be ur 7.0.116) and try once??
Regards
Surendra
09-06-2011 06:26 AM
This sounds like a well structured approach.
Since we have a group of 24/7 users I will not disturb them with a double image download to the 1131's now. So I will start with downloading 7.0.98.218 to the 4404 and then config boot backup so that I continue to use 7.0.116.0 booted from the "Backup Boot Image" position. If this doesn't make a difference I will do the config boot primary (7.0.98.218) / config boot backup (7.0.116.0) sequence.
Thank You for all help so far.
09-06-2011 06:30 AM
lemme know how it goes.. and fee lfree to update the post.. we will be more than happy to assist you!!
Regards
Surendra
09-07-2011 12:23 AM
I'm sorry to say that running 7.0.116.0 from Backup Boot Image is no success. It was stabel during the night:
(Cisco Controller) >show run-config
System Inventory
NAME: "Chassis" , DESCR: "4400 Series WLAN Controller:100 APs"
PID: AIR-WLC4404-100-K9, VID: V02, SN: FOC1143F0J6
:
Product Version.................................. 7.0.116.0
RTOS Version..................................... 7.0.98.218
Bootloader Version............................... 7.0.116.0
Emergency Image Version.......................... 7.0.116.0
:
System Up Time................................... 0 days 8 hrs 6 mins 37 secs
:
Number of Active Clients......................... 68
(Cisco Controller) >show boot
Primary Boot Image............................... Code 7.0.98.218
Backup Boot Image................................ Code 7.0.116.0 (default) (active)
But the Crash File Story for last evening tells a different story:
************************************************************
* Start Cisco Crash Handler *
************************************************************
Model: AIR-WLC4404-100-K9
Version: 7.0.116.0
Timestamp: Wed Sep 7 00:24:33 2011
SystemUpTime: 0 days 0 hrs 50 mins 33 secs
************************************************************
* Start Cisco Crash Handler *
************************************************************
Model: AIR-WLC4404-100-K9
Version: 7.0.116.0
Timestamp: Tue Sep 6 23:32:38 2011
SystemUpTime: 0 days 0 hrs 6 mins 35 secs
************************************************************
* Start Cisco Crash Handler *
************************************************************
Model: AIR-WLC4404-100-K9
Version: 7.0.116.0
Timestamp: Tue Sep 6 23:24:40 2011
SystemUpTime: 0 days 0 hrs 8 mins 3 secs
************************************************************
* Start Cisco Crash Handler *
************************************************************
Model: AIR-WLC4404-100-K9
Version: 7.0.116.0
Timestamp: Tue Sep 6 23:15:20 2011
SystemUpTime: 0 days 0 hrs 8 mins 6 secs
The interesting thing is that my 4402 has no problems with 7.0.116.0. Anyway... I will schedule booth my controllers for a reboot into 7.0.98.218 the forthcomming night. And then I can, probably, confirm the workaround for http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCtr17396
Regards
Erik
09-07-2011 12:26 AM
Hmmm.... Lemme know how it goes!!
Regards
Surendra
09-08-2011 01:32 AM
Downgrading from 7.0.116.0 to 7.0.98.218 solved CSCtr17396. I could of course also reverted to the backup code (6.0.188.0), as stated in the workaround, but we wanted to take the move from 6 to 7.
Note that only the WLC4404 was affected by the bug (CSCtr17396), WLC4402 was not.
Thank you for all the help.
Regards
Erik
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide