cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2276
Views
5
Helpful
8
Replies

Cisco 3560 %PLATFORM-1-CRASHED following %SYS-2-MALLOCFAIL

joneaton
Level 1
Level 1

Hi,

We have approximately 250+ 3560's running C3560-IPBASEK9-M code release12.2(55)SE5, RELEASE SOFTWARE (fc1).

 

Since the beginning of May 1st we have had switches reload due following %SYS-2-MALLOCFAIL messages appearing in the syslog. By the time the Mem Alloc Fail messages appear, the switch is already not able to allow remote command line access via SSH.

 

Initially, it was just one switch, we now have at least 15 that have demonstrated this behaviour. They are in different buildings, on different layer 2 & 3 networks all running a config that has been stable since they were installed.

 

There are a number of different 3560 switches involved, when the switch reloads, it comes back and appears to function correctly again. I suspect the first advice I receive will be to upgrade, which I intend to do, but that requires planning and scheduled outages. Whilst I'm trying to schedule that, I am keen to understand what is causing these memory problems now after many months or years of trouble free service.

 

I've attached the Crashinfo file, if anyone can assist I would be more than interested.

 

Regards Jon.

1 Accepted Solution

Accepted Solutions

Leo Laohoo
Hall of Fame
Hall of Fame

@joneaton wrote:

Since the beginning of May 1st we have had switches reload due following %SYS-2-MALLOCFAIL messages appearing in the syslog. 


Am I correct to assume this statement means that switches may have an uptime of >1 year?

CSCti91268, CSCei18359

My recommendation is to upgrade to the latest 12.2(55)SE train (before the end-of-support date).

View solution in original post

8 Replies 8

Leo Laohoo
Hall of Fame
Hall of Fame

@joneaton wrote:

Since the beginning of May 1st we have had switches reload due following %SYS-2-MALLOCFAIL messages appearing in the syslog. 


Am I correct to assume this statement means that switches may have an uptime of >1 year?

CSCti91268, CSCei18359

My recommendation is to upgrade to the latest 12.2(55)SE train (before the end-of-support date).

Yes, in the majority of cases the up-time is likely to be >1year. I can't confirm exactly, but the network has been stable for the last 2 years with no major problems, only routine engineering work occurring.

 

I think CSCei18359 sound most likely of the two bugs identified. We are working towards deploying upgraded code to all devices we are able to access remotely.

 

 

Still working on deploying code upgrades. For record, had 37 confirmed switch reloads yesterday, all with same errors reported.

 

Had another 5 this morning, again with the same error reported.

 

Most (if not all) of these switches have been functioning fine for ages, all very frustrating.


@joneaton wrote:

For record, had 37 confirmed switch reloads yesterday, all with same errors reported.


So these 37 have an uptime of >1 year and running the same (exact) version?

For the record, I used to (past tense) run 12.2(55)SE5 for several years (3 years, I think) but I all my switches never have an uptime of >9 months.  Either site maintenance or I'd force them all to reboot.  I never want to see any of my switches with >1 year.  

Yes, these did have an uptime over 1 year.

 

Over the years, I often come across switches with uptimes over 1 year. I think the record was an old 3524 with uptime of just over 12 years.

 

A code upgrade seems to have cured the reboots we were experiencing, none (unplanned) in the last 12 hours. 

We have now upgraded our estate to run the 12.2(55)SE12 C3560-IPBASEK9-M code. All appeared stable for a short time.

 

However, we are now experiencing un-scheduled reboots following messages below appearing in the logs.

 

Sep 24 10:11:48 192.168.11.6 BST: %SYS-2-MALLOCFAIL: Memory allocation of 38992 bytes failed from 0x1A096C0,alignment 0
Sep 24 10:11:49 192.168.11.6 Pool: Processor Free: 113128 Cause: Memory fragmentation
Sep 24 10:11:49 192.168.11.6 Alternate Pool: None Free: 0 Cause: No Alternate pool
Sep 24 10:11:49 192.168.11.6 -Process= "HQM Stack Process", ipl= 0, pid= 137
Sep 24 10:11:49 192.168.11.6 -Traceback= 28827E8 2884D08 2884F6C 2B06658 1A096C4 19DD88C 1BA410C 1B9A8E0

 

All other messages from other switches appear to have similar messages. By the time we see the messages, the switch is unable to grant SSH access, for further troubleshooting to take place.

 

I've looked for bugs, (and found a few) but none with any fixed code levels. Any ideas ??

Do you have SNMP monitoring enabled? Take them off.

Please excuse my clarifying statement & question.
They are all polled by SNMP from the management platform using SNMPv2c.

Is that what you are asking, and are you recommending not monitoring the estate of 3560 switches (260+) ?

Review Cisco Networking for a $25 gift card