05-22-2018 02:07 PM - edited 03-01-2019 06:36 PM
Hi,
We have approximately 250+ 3560's running C3560-IPBASEK9-M code release12.2(55)SE5, RELEASE SOFTWARE (fc1).
Since the beginning of May 1st we have had switches reload due following %SYS-2-MALLOCFAIL messages appearing in the syslog. By the time the Mem Alloc Fail messages appear, the switch is already not able to allow remote command line access via SSH.
Initially, it was just one switch, we now have at least 15 that have demonstrated this behaviour. They are in different buildings, on different layer 2 & 3 networks all running a config that has been stable since they were installed.
There are a number of different 3560 switches involved, when the switch reloads, it comes back and appears to function correctly again. I suspect the first advice I receive will be to upgrade, which I intend to do, but that requires planning and scheduled outages. Whilst I'm trying to schedule that, I am keen to understand what is causing these memory problems now after many months or years of trouble free service.
I've attached the Crashinfo file, if anyone can assist I would be more than interested.
Regards Jon.
Solved! Go to Solution.
05-22-2018 04:23 PM - edited 05-22-2018 04:29 PM
@joneaton wrote:
Since the beginning of May 1st we have had switches reload due following %SYS-2-MALLOCFAIL messages appearing in the syslog.
Am I correct to assume this statement means that switches may have an uptime of >1 year?
CSCti91268, CSCei18359
My recommendation is to upgrade to the latest 12.2(55)SE train (before the end-of-support date).
05-22-2018 04:23 PM - edited 05-22-2018 04:29 PM
@joneaton wrote:
Since the beginning of May 1st we have had switches reload due following %SYS-2-MALLOCFAIL messages appearing in the syslog.
Am I correct to assume this statement means that switches may have an uptime of >1 year?
CSCti91268, CSCei18359
My recommendation is to upgrade to the latest 12.2(55)SE train (before the end-of-support date).
05-22-2018 11:40 PM
Yes, in the majority of cases the up-time is likely to be >1year. I can't confirm exactly, but the network has been stable for the last 2 years with no major problems, only routine engineering work occurring.
I think CSCei18359 sound most likely of the two bugs identified. We are working towards deploying upgraded code to all devices we are able to access remotely.
05-23-2018 04:02 AM
Still working on deploying code upgrades. For record, had 37 confirmed switch reloads yesterday, all with same errors reported.
Had another 5 this morning, again with the same error reported.
Most (if not all) of these switches have been functioning fine for ages, all very frustrating.
05-23-2018 04:40 AM - edited 05-23-2018 04:42 AM
@joneaton wrote:
For record, had 37 confirmed switch reloads yesterday, all with same errors reported.
So these 37 have an uptime of >1 year and running the same (exact) version?
For the record, I used to (past tense) run 12.2(55)SE5 for several years (3 years, I think) but I all my switches never have an uptime of >9 months. Either site maintenance or I'd force them all to reboot. I never want to see any of my switches with >1 year.
05-24-2018 12:56 AM
Yes, these did have an uptime over 1 year.
Over the years, I often come across switches with uptimes over 1 year. I think the record was an old 3524 with uptime of just over 12 years.
A code upgrade seems to have cured the reboots we were experiencing, none (unplanned) in the last 12 hours.
09-24-2018 02:15 AM
We have now upgraded our estate to run the 12.2(55)SE12 C3560-IPBASEK9-M code. All appeared stable for a short time.
However, we are now experiencing un-scheduled reboots following messages below appearing in the logs.
Sep 24 10:11:48 192.168.11.6 BST: %SYS-2-MALLOCFAIL: Memory allocation of 38992 bytes failed from 0x1A096C0,alignment 0
Sep 24 10:11:49 192.168.11.6 Pool: Processor Free: 113128 Cause: Memory fragmentation
Sep 24 10:11:49 192.168.11.6 Alternate Pool: None Free: 0 Cause: No Alternate pool
Sep 24 10:11:49 192.168.11.6 -Process= "HQM Stack Process", ipl= 0, pid= 137
Sep 24 10:11:49 192.168.11.6 -Traceback= 28827E8 2884D08 2884F6C 2B06658 1A096C4 19DD88C 1BA410C 1B9A8E0
All other messages from other switches appear to have similar messages. By the time we see the messages, the switch is unable to grant SSH access, for further troubleshooting to take place.
I've looked for bugs, (and found a few) but none with any fixed code levels. Any ideas ??
09-24-2018 02:32 AM
09-24-2018 02:36 AM
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide