Troubleshooting MALLOCFAIL Errors and General Memory Problems

Brandon Lynch · ‎07-24-2010

Purpose

Memory problems can manifest themselves in several ways on switches and routers. In many instances, a device experiencing memory errors will be reloaded before the appropriate data can be gathered. The intent of this document is to discuss MALLOCFAIL errors in general and things to check and gather prior to opening a TAC case or reloading the device to expedite problem resolution. This document is not exhaustive but should serve as a general guideline for troubleshooting memory issues with many routers and switches.

MALLOCFAIL Errors

Memory issues generally show up in the form of MALLOCFAIL errors in the logs of your router or switch. These errors are important because they tell us a couple of things about what is happening and give us clues on where to look. A sample MALLOCFAIL error is given below -

%SYS-2-MALLOCFAIL: Memory allocation of 65536 bytes failed from 0x60103098, alignment 0 
Pool: Processor  Free: 5453728  Cause: Memory fragmentation 
Alternate Pool: None  Free: 0  Cause: No Alternate pool

The first thing to notice is how much memory we're trying to allocate and how much free memory we have. In this example, we're trying to allocate 65KB from a pool which only has ~5.45MB free. This tells us that, even though we have enough free memory, the largest contiguous block is smaller than 65KB and the memory allocation failed. While, by definition, this is considered memory fragmentation, this is not usually the cause. Most often, it's simply a matter of running low on memory in the pool itself.

The second thing to notice is the pool type. In the example above, we're dealing with the 'Processor' pool. This is important because it is the first 'road sign' that directs us to where we need to look and what needs to be checked. The pool specified will usually either be 'Processor' or 'I/O'. An example of an I/O memory error is given below -

%SYS-2-MALLOCFAIL: Memory allocation of 65548 bytes failed from 0x400B8564, alignment 32
Pool: I/O  Free: 39696  Cause: Not enough free memory
Alternate Pool: None  Free: 0  Cause: No Alternate pool

We'll get into further definition of these pools below. Once the pool has been identified, we can then proceed to focus our efforts accordingly in the right spots.

'Processor' Pool

The 'Processor' pool is used, as the name implies, for the various processes that run on the router or switch. There are specific processes which underlie most IOS versions and platforms that will use memory. For example, *Init* is a process established on boot-up of most devices and is expected that you'll see it across various platforms. Other processes that may show up will be based on the configuration of the individual device. For example, on platforms in which voice is configured and utilized, you will see voice specific processes consuming memory while in more generalized configurations without voice, these processes will not hold as much or any memory at all.

Certain processes can be expected to hold more memory than others but if there are questions or concerns about a particular process, it's best to open a TAC case to have it checked out.

Causes and What to Collect

1) If an IOS upgrade has been recently done on the device, the first thing to check is the minimum required DRAM for the new image. This should be equal to or less than the amount of DRAM installed on the box itself. The minimum required DRAM will be listed under the image within the Software Download Tool. The amount of DRAM installed can be confirmed from the 'show version' output -

Cisco 2821 (revision 53.51) with 210944K/51200K bytes of memory.

Summing these numbers together, we see that this 2821 has 256MB of DRAM.

2) Another possible cause is a memory leak caused by an IOS bug. In this situation, one process will consume an excessive amount of memory until we run out. The following outputs should be collected at the time of the memory is low -

show clock

show mem stat

show proc mem sorted

show mem all totals

show log

'show proc mem sorted' will list all processes in descending order from highest amount of memory held to lowest. Excluding *Init*, try to identify the highest process. Once done, find the PID for that process on the left-hand size of the output and collect the following -

show proc mem <PID #>

If the highest process is *Dead*, collect this instead -

show mem dead totals

show mem dead

Certain processes require more in-depth troubleshooting but for simplicity, they will be excluded from this document.

3) Another potential cause of memory issues is running out of memory due to the processes and configuration on the box. One example of this is 'BGP Router'. In some instances, BGP will hold a large amount of memory because of the number of routes that it's taking in and is not an IOS bug. This would need to be corrected by altering the configuration to achieve optimal routing and reduce memory consumption.

If you are unsure, collect the outputs listed above (excluding 'show mem dead totals' and 'show mem dead') in conjunction with opening a TAC case as this will likely need to be confirmed further.

'I/O' Pool

The I/O pool refers to the I/O buffers seen with 'show buffers'. These buffers are used for process-switched traffic, among other things, such as routing updates or broadcasts. I/O memory is broken down into 'pools' as you'll see from a 'show buffers' output. These pools are based on packet size so that we can more efficiently allocate memory based on what is needed.

Causes and What to Collect

1) The first thing to check with I/O memory issues is a potential buffer leak caused by an IOS bug. This will often, but not always, manifest itself as a particular pool increasing its amount of buffers without releasing them back into the I/O pool once they are no longer needed. An example of this is given below -

--------- show buffers --------

Buffer elements:
     500 in free list (500 max allowed)
     3220350364 hits, 0 misses, 0 created

Public buffer pools:
Small buffers, 104 bytes (total 6144, permanent 6144):
     3867 in free list (2048 min, 8192 max allowed)
     248913132 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Medium buffers, 256 bytes (total 86401, permanent 3000, peak 86401 @ 05:18:11):
     0 in free list (64 min, 3000 max allowed)
     9697361 hits, 203293 misses, 2208 trims, 85609 created
     167633 failures (651288 no memory)
Middle buffers, 600 bytes (total 512, permanent 512):
     0 in free list (64 min, 1024 max allowed)
     9284431 hits, 237750 misses, 0 trims, 0 created
     224619 failures (680486 no memory)
Big buffers, 1536 bytes (total 1000, permanent 1000):
     0 in free list (64 min, 1000 max allowed)
     69471745 hits, 895218 misses, 0 trims, 0 created
     842142 failures (1821074 no memory)
VeryBig buffers, 4520 bytes (total 10, permanent 10, peak 122 @ 1w3d):
     0 in free list (0 min, 100 max allowed)
     2120517 hits, 1632477 misses, 112 trims, 112 created
     1632421 failures (3272987 no memory)
Large buffers, 9240 bytes (total 8, permanent 8, peak 18 @ 1w3d):
     0 in free list (0 min, 10 max allowed)
     9593 hits, 832217 misses, 44 trims, 44 created
     832195 failures (1651309 no memory)
Huge buffers, 18024 bytes (total 2, permanent 2):
     0 in free list (0 min, 4 max allowed)
     1325 hits, 831497 misses, 0 trims, 0 created
     831494 failures (1649904 no memory)

From the output above, we can clearly see that the problem is with the 'Medium' pool. It's 'Total' value is much higher than the 'permanent' amount set for that pool. We see that, even with over 86K buffers in the pool, we have 0 in the 'free list'. Finally, we see that number of 'trims' is much lower than the number 'created' which tells us we haven't realeased these back into the I/O pool for further consumption. For further explanation of these fields, see the 'Definitions for Buffer Pool Fields' link in the 'Additional References' section.

For this scenario, the following outputs should first be collected -

show clock

show mem stat

show buffers

show log

Once the problematic pool or pools are narrowed down, we can then focus in on that pool with this output -

show buffer pool <pool name> packet

This may provide extensive output and usually a few pages of it are enough to give an idea of what the packets are that are residing in these buffers and who allocated them.

2) Another possible cause is a network/traffic event. This will often manifest itself as excessive utilization in multiple pools. It is recommended that the above outputs be collected, along with 'show buffer pool <pool name> packet' for the pools which show this utilization, with opening a TAC case. This can often be caused by an abnormal or unexpected traffic flow which must be process-switched by the device. Because the flow may be bursty and quick, we can run out of I/O memory in a relatively short period of time. Troubleshooting this type of problem usually involves identifying the source of the traffic to see if this was abnormal and if so, eliminate or block it.

3) Another, more rare event, is that a specific pool is more heavily utilized because of certain traffic that is needed in a network environment. This traffic may, for some reason, need to be process-switched and there is no way to avoid this at the current time. This would need to be confirmed further and appropriate action could be taken at that time. The same outputs from step 1 would apply here.

Things to Look For

On most routers, the MALLOCFAIL error examples given above will be standard. On 6500s and 7600s with SUPs or RSPs, these errors may vary. For example, the following error was taken from the RP logs on a 6500 switch -

%SYS-SP-2-MALLOCFAIL: Memory allocation of 820 bytes failed from 0x40C83B60, alignment 32 
Pool: I/O  Free: 48  Cause: Not enough free memory 
Alternate Pool: None  Free: 0  Cause: No Alternate pool

Within the MALLOCFAIL error, we see that the SP of the SUP is reporting the problem, not the RP. If the problem was associated with the RP, the 'SP' designation in the error would be missing. For this reason, the outputs above would need to be taken from the SP and can be accomplished by preceding the commands with -

remote command switch

The error message may also sometimes refer to the standby SUP/RSP's RP or SP as denoted by 'STDBY' and would need to be collected accordingly.

Summary

Collecting the outputs given above may help to speed up case resolution and bring stability to your device more quickly. As is always the case, if any questions arise or if there is uncertainty about how the memory on a device is performing, it's best to open up a TAC case to have it checked out.

Additional References

Troubleshooting Memory Problems

Definitions for Buffer Pool Fields

Arumugam Muthaiah · ‎09-21-2012

Excellent doc!!!! It helps to understand and basic memory troubleshooting