cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5404
Views
20
Helpful
30
Replies

ASR1001-X booting up problem due to heavy configuration

kozharov
Level 1
Level 1

Dear experts,

 

I’ll appreciate your comment and advice on the situation we've encountered recently.

 

Our customer’s ASR1001-X router worked just fine until it required rebooting due to maintenance activity.

 

The router refused to boot up (we guess it is due to heavy configuration file) and our customer managed to finally boot it up only in several hours, uploading configuration manually in small portions.

 

Here are some details:

 

  • cisco ASR1001-X (1NG) processor (revision 1NG), 16G of physical memory, asr1001x-universalk9.16.03.06.SPA.bin
  • 7000+ active GRE tunnels and 7000+ bgp peers over these tunnels (mainly advertising default route and receiving small amount of specifics).

 

 

While ASR is running, “sh cpu” and “sh memory” are just fine (25% CPU) (5GB of free memory) with moderate traffic load (mostly telemetry).

 

 

After reboot we apparently experience the lack of resources and do see these type of messages:

 

%SYS-2-MALLOCFAIL: Memory allocation

Pool: Processor  Free:  Cause: Memory fragmentation

Alternate Pool: Cause: No Alternate pool

%SYS-2-CHUNKEXPANDFAIL: Could not expand chunk pool for Packet Elements. No memory available -Process= "Chunk Manager"

%SYS-2-CFORKMEM: Process creation of BGP Open failed (no memory). -Process= "BGP Router",

etc.

 

 

We know that we are exceeding datasheet limits of ASR1001-X as due to the datasheet “Up to 4,000 tunnels GRE are supported” but it works fine under load, the problem is only during booting up.

 

 

The questions are:

 

  • Is there any workaround to help ASR1001-X router booting up with this heavy configuration?

 

  • What could be the recommended upgrade for current ASR1001-X, may be a shift to more powerful platform needed?

 

Unfortunately we can’t address this question to TAC because our service package is expired.

 

 

Also I’m under NDA and can’t upload full detailed config, logs, etc.

 

 

An upgrade to the latest software asr1001x-universalk9.16.12.05.SPA has not helped.

 

Having second ASR1001-X is a clear option.

 

 

Thank you!

 

Mikhail

30 Replies 30

David Spindola
Cisco Employee
Cisco Employee

Hello kozharov,

 

The log for the memory usage refers to Chunk Manager, this book explains in detail what is Chunk Manager:

  • Inside Cisco IOS Software Architecture (CCIE Professional Development Series)

For simplicity sake, let's just keep in mind that this is a memory manager.

 

This means that we need to identify first what memory it is managing, this is important because the ASR1k has multiple views of the memory available:

  • Physical memory installed
  • Memory assigned to the Linux Kernel
  • Memory assigned per process daemon (like IOSd)
  • Swap memory
  • QFP memory

A detailed document on how to check the memory is:

Chunk Manager works within IOSd, that means that we need to run the command show process memory sorted in order to identify the amount of memory holding at the time we ran the command, the output looks like this:

Router# show processes memory sorted
Processor Pool Total: 1821391588 Used:  218319000 Free: 1603072588
 lsmpi_io Pool Total:    6295088 Used:    6294116 Free:        972 

 PID TTY  Allocated      Freed    Holding    Getbufs    Retbufs Process
   0   0  174405308    8586260  134742552        811     137870 *Init*
   0   0   21603272   48285960     274932          3          1 *Dead*
   0   0          0          0     406304          0          0 *MallocLite*
   1   0     431576          0     448716          0          0 Chunk Manager

 

This might be related to the configuration, but there is also a chance that is not related at all to the configuration. So I suggest to skip any theory for now, it will be better if you take the outputs and share them in a comment within the post.

 

The next step to narrow down the usage is to run the command show memory allocating-process totals which is a large output, but include the allocation summary at the end for all the features within the router, here is an example:

Router# show memory allocating-process totals  

<output ommitted>

Allocator PC Summary for: Processor
Displayed first 2048 Allocator PCs only

    PC          Total   Count  Name
0x243ACE40  137551540   29053  List Headers
0x24EE3FF8   19457616     319  PA FO
0x2425788C   12249176     365  CFT Data Path F

 

This output includes a Count column, which is useful to identify the number of blocks of memory used by a specific feature.

With this information we should have a good idea on what is consuming the memory, you can also run the show tech command, and only add the Top 100 allocator pc summary section of the file.

Hello David,

 

Thank you for you detailed comment.

 

These outputs are attached for the reference:

 

- show platform software status control-processor brief
- show processes memory platform sorted location r0
- show memory allocating-process totals

Hello kozharov,

 

Thanks for attaching the outputs, I saw that many participants in this post asked for:

  • show processes memory platform sorted location r0

 

Nevertheless this command shows the platform memory from the Linux Kernel perspective, that means that is not a detailed output, which makes troubleshooting difficult for IOSd memory depletion. This command only shows how much memory IOSd consumes, but no details on what is being used:

ASR_RED# show processes memory platform sorted location r0 
System memory: 16303308K total, 3855356K used, 12447952K free,
Lowest: 12447952K
   Pid    Text      Data   Stack   Dynamic       RSS     Total              Name  
--------------------------------------------------------------------------------
 24958  352186   1833484       0        80   1833484   9204740   linux_iosd-imag 

 

I want to clear that misconception hoping that the audience of the post also reads this. Better commands to narrow down the problem are:

  • show memory allocating-process totals
  • show process memory sorted

 

The memory output that you shared indicates that the processor memory has enough free memory:

ASR_RED#  show memory allocating-process totals 
                Head    Total(b)     Used(b)     Free(b)   Lowest(b)  Largest(b)
Processor  7FDA2AB7D010   6963791024   1038143276   5925647748   5867158488   5702243900

 

The Processor pool is the pool of memory that you reported as affected:

%SYS-2-MALLOCFAIL: Memory allocation
Pool: Processor  Free:  Cause: Memory fragmentation

 

There is a chance that when the device boots and all the BGP routing table is populated the memory is fully utilized, and due to that event there is not enough memory for a short period of time which results into that log message.

 

My suggestion to prove that theory is to enable a script that triggers after the log message MALLOCFAIL is displayed which captures the commands necessary to identify what is depleting the memory.

 

You can use this script:

event manager applet MEMORYTSHOOT authorization bypass

 event syslog pattern "MALLOCFAIL" maxrun 60

 action 000 info type routername
 action 001 cli command "enable"
 action 003 cli command "terminal length 0"

 action 100 set filename "flash:memory-$_event_pub_sec"
 action 105 syslog msg "Memory allocation failure detected, logging data in $filename"

 action 199 file open FD $filename a+

 action 300 foreach cmd "show version,show process memory sorted,show log,show memory allocating-process totals" ","

 action 301 cli command $cmd
 action 302 file puts FD "------------------ $cmd ------------------"
 action 303 file puts FD "$_cli_result"
 action 399 end

 action 900 file close FD
 action 998 syslog msg "File '$filename' created with outputs to troubleshoot the memory allocation failure event"
 action 999 cli command "end"

end

 

If you prefer to not run the script, then only run these commands manually when the log message triggers:

  • show version
  • show process memory sorted
  • show log
  • show memory allocating-process totals

 

Dear David,

 

Thank you for your attempt to help.

 

“There is a chance that when the device boots and all the BGP routing table is populated the memory is fully utilized, and due to that event there is not enough memory for a short period of time which results into that log message.” – yes, we explain the situation the same way.

 

BTW, during booting up with full configuration the console becomes irresponsive.

 

Also, I’ve got some more details: the router could managed to boot itself up with partial config with 6000 tunnels. With full configuration (7000+) tunnels it was trying to boot itself up for more than an hour but without success.

 

 

 

During normal operation there is no such message SYS-2-MALLOCFAIL, it appeared only during booting up process… During normal operation ASR works just perfect.

 

The problem also is that our customer doesn’t want to perform any experiments on the live network just to study the situation and requests a strong ground to perform any further actions so there is no way to apply any scripting and try another reboot, sorry…


@kozharov wrote:

BTW, during booting up with full configuration the console becomes irresponsive.


Is this the same behaviour if, say, the WAN ports are disabled?

Dear Leo,

 

Here is more detailed information from the customer during real situation/incident and two experiments with the router later on:

 

 

Real situation:

start booting up with full config

no response from router for 40 min

unplugging interfaces

console becomes responsive in several minutes

int range 6001-7600 shutdown

reboot + plugging interfaces back ->router comes up

int range 6001-7600 no shutdown

router with full config comes up

 

Experiment 1

start booting up with full config and unplugged interfaces

router is up with full config with unplugged interfaces

plugging interfaces back -> console becomes irresponsive

router never comes up

 

Experiment 2

start booting up with 6000 tunnels and unplugged interfaces

router is up with 6000 tunnels with unplugged interfaces

plugging interfaces back -> router finally comes up

int range 6001-7600 no shutdown

router with full config is up

 

 

Mikhail

I think the router is being DDoS-ed.

Dear Leo,

 

no, no and no! The router is DOSed with overwhelming configuration! And it seems nothing can be done to help router to swallow and process the full configuration.

Ok, try this way: 

  1. Disable all the WAN links and Tunnels. 
  2. Reboot the router.  The router should boot up with all WAN links and Tunnels disabled. 
  3. Enable the WAN link ONLY.
  4. Watch the CPU, memory and the interface counters.  
  5. Say, if nothing happens after 30 minutes, enable the Tunnels 100 at a time -- Do not enable all the tunnels.  
  6. Again, watch the CPU, memory and interface counters. 
  7. Which step does the router become unresponsive?

Dear Leo,

 

Can't perform exactly this procedure as our Customer refuses performing any further activity on live network but I know the result from earlier experiments. If there are equal or less than 6000 GRE Tunnels in the configuration - router comes up, if there are more than 6000 GRE Tunnels in the configuration - the router becomes unresponsive and never comes up.

 

Mikhail

¯\_(ツ)_/¯

Thanks for the details kozharov.

 

With all the information so far and the evidence we can only conclude that the whole setup at startup is too much for the memory. Nevertheless, if the customer agrees at any point into gathering further details, I would suggest to run the commands and open a case with TAC, since they can help you narrowing down the memory usage, and identify a solution/workaround.

 

Every feature configured in the device will require some amount of memory, and eventually compete each other to obtain it, since this only occurs through the boot process, you might want to sync with the network architects for this customer and discuss the possibility to implement the configuration one section at the time through automation.

 

One example could be through a script that starts by applying QoS, and when its done it moves to apply the routing section, and at the end brings the tunnels up.

 

Cisco Professional Services can assist you and the customer to find the best way to approach this, or design the network based on other equipment with enough capabilities to handle the load.

 

I wish you a good weekend.

Dear David, thank you for all your attention!

 

It seems nothing can be done to resolve the situation.

 

So either total amount of tunnels should be shared between two routers or more powerful router shoud come into place (but it is still not clear which one can help).

 

 

Thank you!

 

Mikhail

Hello @kozharov ,

 

>> So either total amount of tunnels should be shared between two routers 

 

I would go this way, in any case you have not redundancy at node level at the moment. But you need to have a network that can recover from a reboot of one node with no human action.

 

What makes heavy is that at each GRE tunnel is associated a different BGP session.

@David Spindola has provided a lot of useful information but you know you have gone beyond the platform limits.

 

Hope to help

Giuseppe

Dear Giuseppe, thank you for your attention to the problem and for your valuable comments.

 

Mikhail

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: