07-29-2021 01:51 AM
Dear experts,
I’ll appreciate your comment and advice on the situation we've encountered recently.
Our customer’s ASR1001-X router worked just fine until it required rebooting due to maintenance activity.
The router refused to boot up (we guess it is due to heavy configuration file) and our customer managed to finally boot it up only in several hours, uploading configuration manually in small portions.
Here are some details:
While ASR is running, “sh cpu” and “sh memory” are just fine (25% CPU) (5GB of free memory) with moderate traffic load (mostly telemetry).
After reboot we apparently experience the lack of resources and do see these type of messages:
%SYS-2-MALLOCFAIL: Memory allocation
Pool: Processor Free: Cause: Memory fragmentation
Alternate Pool: Cause: No Alternate pool
%SYS-2-CHUNKEXPANDFAIL: Could not expand chunk pool for Packet Elements. No memory available -Process= "Chunk Manager"
%SYS-2-CFORKMEM: Process creation of BGP Open failed (no memory). -Process= "BGP Router",
etc.
We know that we are exceeding datasheet limits of ASR1001-X as due to the datasheet “Up to 4,000 tunnels GRE are supported” but it works fine under load, the problem is only during booting up.
The questions are:
Unfortunately we can’t address this question to TAC because our service package is expired.
Also I’m under NDA and can’t upload full detailed config, logs, etc.
An upgrade to the latest software asr1001x-universalk9.16.12.05.SPA has not helped.
Having second ASR1001-X is a clear option.
Thank you!
Mikhail
07-29-2021 09:52 AM
Hello kozharov,
The log for the memory usage refers to Chunk Manager, this book explains in detail what is Chunk Manager:
For simplicity sake, let's just keep in mind that this is a memory manager.
This means that we need to identify first what memory it is managing, this is important because the ASR1k has multiple views of the memory available:
A detailed document on how to check the memory is:
Chunk Manager works within IOSd, that means that we need to run the command show process memory sorted in order to identify the amount of memory holding at the time we ran the command, the output looks like this:
Router# show processes memory sorted
Processor Pool Total: 1821391588 Used: 218319000 Free: 1603072588
lsmpi_io Pool Total: 6295088 Used: 6294116 Free: 972
PID TTY Allocated Freed Holding Getbufs Retbufs Process
0 0 174405308 8586260 134742552 811 137870 *Init*
0 0 21603272 48285960 274932 3 1 *Dead*
0 0 0 0 406304 0 0 *MallocLite*
1 0 431576 0 448716 0 0 Chunk Manager
This might be related to the configuration, but there is also a chance that is not related at all to the configuration. So I suggest to skip any theory for now, it will be better if you take the outputs and share them in a comment within the post.
The next step to narrow down the usage is to run the command show memory allocating-process totals which is a large output, but include the allocation summary at the end for all the features within the router, here is an example:
Router# show memory allocating-process totals
<output ommitted>
Allocator PC Summary for: Processor
Displayed first 2048 Allocator PCs only
PC Total Count Name
0x243ACE40 137551540 29053 List Headers
0x24EE3FF8 19457616 319 PA FO
0x2425788C 12249176 365 CFT Data Path F
This output includes a Count column, which is useful to identify the number of blocks of memory used by a specific feature.
With this information we should have a good idea on what is consuming the memory, you can also run the show tech command, and only add the Top 100 allocator pc summary section of the file.
07-30-2021 02:25 AM
07-30-2021 03:02 AM - edited 07-30-2021 03:03 AM
Hello kozharov,
Thanks for attaching the outputs, I saw that many participants in this post asked for:
Nevertheless this command shows the platform memory from the Linux Kernel perspective, that means that is not a detailed output, which makes troubleshooting difficult for IOSd memory depletion. This command only shows how much memory IOSd consumes, but no details on what is being used:
ASR_RED# show processes memory platform sorted location r0 System memory: 16303308K total, 3855356K used, 12447952K free, Lowest: 12447952K Pid Text Data Stack Dynamic RSS Total Name -------------------------------------------------------------------------------- 24958 352186 1833484 0 80 1833484 9204740 linux_iosd-imag
I want to clear that misconception hoping that the audience of the post also reads this. Better commands to narrow down the problem are:
The memory output that you shared indicates that the processor memory has enough free memory:
ASR_RED# show memory allocating-process totals Head Total(b) Used(b) Free(b) Lowest(b) Largest(b) Processor 7FDA2AB7D010 6963791024 1038143276 5925647748 5867158488 5702243900
The Processor pool is the pool of memory that you reported as affected:
%SYS-2-MALLOCFAIL: Memory allocation Pool: Processor Free: Cause: Memory fragmentation
There is a chance that when the device boots and all the BGP routing table is populated the memory is fully utilized, and due to that event there is not enough memory for a short period of time which results into that log message.
My suggestion to prove that theory is to enable a script that triggers after the log message MALLOCFAIL is displayed which captures the commands necessary to identify what is depleting the memory.
You can use this script:
event manager applet MEMORYTSHOOT authorization bypass event syslog pattern "MALLOCFAIL" maxrun 60 action 000 info type routername action 001 cli command "enable" action 003 cli command "terminal length 0" action 100 set filename "flash:memory-$_event_pub_sec" action 105 syslog msg "Memory allocation failure detected, logging data in $filename" action 199 file open FD $filename a+ action 300 foreach cmd "show version,show process memory sorted,show log,show memory allocating-process totals" "," action 301 cli command $cmd action 302 file puts FD "------------------ $cmd ------------------" action 303 file puts FD "$_cli_result" action 399 end action 900 file close FD action 998 syslog msg "File '$filename' created with outputs to troubleshoot the memory allocation failure event" action 999 cli command "end" end
If you prefer to not run the script, then only run these commands manually when the log message triggers:
07-30-2021 04:33 AM
Dear David,
Thank you for your attempt to help.
“There is a chance that when the device boots and all the BGP routing table is populated the memory is fully utilized, and due to that event there is not enough memory for a short period of time which results into that log message.” – yes, we explain the situation the same way.
BTW, during booting up with full configuration the console becomes irresponsive.
Also, I’ve got some more details: the router could managed to boot itself up with partial config with 6000 tunnels. With full configuration (7000+) tunnels it was trying to boot itself up for more than an hour but without success.
During normal operation there is no such message SYS-2-MALLOCFAIL, it appeared only during booting up process… During normal operation ASR works just perfect.
The problem also is that our customer doesn’t want to perform any experiments on the live network just to study the situation and requests a strong ground to perform any further actions so there is no way to apply any scripting and try another reboot, sorry…
07-30-2021 04:58 AM
@kozharov wrote:
BTW, during booting up with full configuration the console becomes irresponsive.
Is this the same behaviour if, say, the WAN ports are disabled?
07-30-2021 05:29 AM
Dear Leo,
Here is more detailed information from the customer during real situation/incident and two experiments with the router later on:
Real situation:
start booting up with full config
no response from router for 40 min
unplugging interfaces
console becomes responsive in several minutes
int range 6001-7600 shutdown
reboot + plugging interfaces back ->router comes up
int range 6001-7600 no shutdown
router with full config comes up
Experiment 1
start booting up with full config and unplugged interfaces
router is up with full config with unplugged interfaces
plugging interfaces back -> console becomes irresponsive
router never comes up
Experiment 2
start booting up with 6000 tunnels and unplugged interfaces
router is up with 6000 tunnels with unplugged interfaces
plugging interfaces back -> router finally comes up
int range 6001-7600 no shutdown
router with full config is up
Mikhail
07-30-2021 05:51 AM
I think the router is being DDoS-ed.
07-30-2021 06:00 AM
Dear Leo,
no, no and no! The router is DOSed with overwhelming configuration! And it seems nothing can be done to help router to swallow and process the full configuration.
07-30-2021 06:07 AM
Ok, try this way:
07-30-2021 06:22 AM
Dear Leo,
Can't perform exactly this procedure as our Customer refuses performing any further activity on live network but I know the result from earlier experiments. If there are equal or less than 6000 GRE Tunnels in the configuration - router comes up, if there are more than 6000 GRE Tunnels in the configuration - the router becomes unresponsive and never comes up.
Mikhail
07-30-2021 06:37 AM
¯\_(ツ)_/¯
07-30-2021 05:12 AM
Thanks for the details kozharov.
With all the information so far and the evidence we can only conclude that the whole setup at startup is too much for the memory. Nevertheless, if the customer agrees at any point into gathering further details, I would suggest to run the commands and open a case with TAC, since they can help you narrowing down the memory usage, and identify a solution/workaround.
Every feature configured in the device will require some amount of memory, and eventually compete each other to obtain it, since this only occurs through the boot process, you might want to sync with the network architects for this customer and discuss the possibility to implement the configuration one section at the time through automation.
One example could be through a script that starts by applying QoS, and when its done it moves to apply the routing section, and at the end brings the tunnels up.
Cisco Professional Services can assist you and the customer to find the best way to approach this, or design the network based on other equipment with enough capabilities to handle the load.
I wish you a good weekend.
07-30-2021 05:36 AM
Dear David, thank you for all your attention!
It seems nothing can be done to resolve the situation.
So either total amount of tunnels should be shared between two routers or more powerful router shoud come into place (but it is still not clear which one can help).
Thank you!
Mikhail
07-30-2021 07:26 AM
Hello @kozharov ,
>> So either total amount of tunnels should be shared between two routers
I would go this way, in any case you have not redundancy at node level at the moment. But you need to have a network that can recover from a reboot of one node with no human action.
What makes heavy is that at each GRE tunnel is associated a different BGP session.
@David Spindola has provided a lot of useful information but you know you have gone beyond the platform limits.
Hope to help
Giuseppe
07-30-2021 07:41 AM
Dear Giuseppe, thank you for your attention to the problem and for your valuable comments.
Mikhail
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide