Solved: Re: 2811, High CPU load and occasional crash. HELP.

Utair Corporation · ‎04-25-2011

2811 router

Cisco IOS Software, 2800 Software (C2800NM-ADVENTERPRISEK9_IVS_LI-M), Version 12.4(24)T4, RELEASE SOFTWARE (fc2)

CPU utilization for five seconds: 82%/73%; one minute: 74%; five minutes: 77%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
116      222084       57773       3844 1.83% 1.69% 1.66%   0 IP Input
183       85148      309479        275 1.03% 1.29% 1.20%   0 HQF Shaper Backg
363       93796       27445       3417 0.23% 0.88% 1.00%   0 IP SNMP
365       83772       13486       6211 0.15% 0.75% 0.87%   0 SNMP ENGINE
381       24640        7660       3216 0.07% 0.45% 0.25%   0 IP-EIGRP: PDM
371       49196       38413       1280 0.07% 0.41% 0.40%   0 MGCP Application
348       34816       63287        550 0.31% 0.37% 0.36%   0 PPP manager
239       29164      116614        250 0.39% 0.35% 0.33%   0 MGCP App STW Tic
382       33772       11170       3023 0.31% 0.33% 0.32%   0 IP-EIGRP: HELLO
   5       18124         832      21783 1.19% 0.27% 0.19%   0 Check heaps
325       24264        9214       2633 0.31% 0.25% 0.20%   0 VOIP_RTCP
323       23748        8350       2844 0.15% 0.18% 0.16%   0 DSMP

What it means when CPU load if above 70% and no visible process doing such load?

EIGRP.

I'm using CEF. No policy routing.

There is 5mbit/s Internet connection with NAT, and 8 IPSec tunnels.

Two IPVPN 2mbit/s connenections with GRE Tunnels over them.

There is also MGCP gateway with 1 E1 PRI trunk to PBX.

About month ago it started to reboot occasionaly with crashinfo:

16:19:39 SUR Tue Apr 19 2011: Data Bus Error exception, CPU signal 10, PC = 0x40CD7014

--------------------------------------------------------------------

Possible software fault. Upon reccurence, please collect

crashinfo, "show tech" and contact Cisco Technical Support.

--------------------------------------------------------------------

-Traceback= 0x40CD1014z 0x40CCD80Cz 0x40CCDB64z 0x40CD1ABCz 0x435E1CA0z 0x435E1C84z

$0 : 00000000, AT : 47260000, v0 : 4B8E2289, v1 : 00000000

a0 : 4B8A2BE0, a1 : 0000007D, a2 : 4B8E1C84, a3 : 0000000F

t0 : 0000003C, t1 : 0000003C, t2 : 45438512, t3 : 00FF0000

t4 : 468D0000, t5 : 476A72E8, t6 : 00000173, t7 : 0000001C

s0 : 477A0000, s1 : 4A4E9480, s2 : 477A5660, s3 : 48DDF0D4

s4 : 00000001, s5 : 4B8E228A, s6 : 477A0000, s7 : 00000000

t8 : 47650000, t9 : 45440000, k0 : 4A518040, k1 : 435F793C

gp : 4726D8A0, sp : 4A4E9470, s8 : 00000000, ra : 95DCB70A

EPC : 40CD7014, ErrorEPC : BFC00E8C, SREG : 3400FF03

MDLO : 00113000, MDHI : 00000000, BadVaddr : A5408844

TEXT_START : 0x40015900

DATA_START : 0x444C6000

Cause 8000001C (Code 0x7): Data Bus Error exception

Michael Simon · ‎05-03-2011

There is no special image to test memory.

As the system boots it does a quick scan thru the memory. But this is not an exhaustive failure analysis.

There is an old memory command I have not tried in a very long time:

router# show memory scan

This comand basically just runs a parity error check.

We treat these events as an issue requiring an RMA because the failure that results in a bus error is not a parity error. The error condition is that the memory failed to respond to the request to read or write rather than the result was inaccurate.

This is likely a failure of the hardware supporting access to the memory rather than the memory itself. That makes this a more severe event than a simple parity error. This is why we replace the router and memory.

The memory itself does not necessarily have any issue or failure.

If you find my reply answered your question please mark it as answered.

........Mike

View solution in original post

David Aicher · ‎04-25-2011

The show process cpu command is a bit cryptic.

http://www.cisco.com/en/US/products/sw/iosswrel/ps1828/products_tech_note09186a00800a65d0.shtml#showproccpu

The five second value is split total/interrupt. what is left over is found under processes. In your case "82%/73%" less than 10% is attributed to processes. Interrupt is somewhat vague but this is where most of the work is done normally. Packet forwarding including voice and encryption is all done under interrupt. In this case 73% under interrupt may be normal for the router with the traffic and features you have.

The 2811 is rated as a 2 T1/E1 box with features enabled. This is the independent testing report done by miercom.

http://www.miercom.com/dl.php?fid=20061201&type=report

The CPU is explainable but the crashes are a different story. I would suggest opening a tac case to have the crashinfo files analyzed. Even with high CPU the router should not crash. The router may be slow to respond but crashing is not normal under any circumstance.

Regards

Dave

Utair Corporation · ‎04-25-2011

We have no SmartNet for this router

Assuming crashes started only a coule weeks ago and there were no IOS change and no significant config changes, could it be hardware issue?

Michael Simon · ‎04-28-2011

The crash is a bus error crash.

These can be hardware but are typically software. There are three ways you get a bus error crash: The system attempts to read from or write to a memory address at which there is no actual memory; the system attempts to write to read only memory; or the system attempts a valid action at an address where there actually is memory and due to a hardware failure there is a problem.

The first is the most common cause.

The way to identify a hardware failure bus error is to compare the address the system tried to use to the 'show region' command.

The 'show region' command lists the start and end of all valid memory areas.

The address is listed in the crashinfo file as the BadVaddr.

In your crash the bad address is: BadVaddr : A5408844

See if this address is in a valid area of memory. If it is you have a hardware failure on that memory.

........Mike

Utair Corporation · ‎04-28-2011

#sho region
Region Manager:

      Start         End     Size(b) Class Media Name
0x0F200000 0x0FFFFFFF    14680064 Iomem R/W    iomem:(uncached_iomem_region)
0x3F200000 0x3FFFFFFF    14680064 Iomem R/W    iomem
0x40000000 0x4F1FFFFF   253755392 Local R/W    main
0x400152A0 0x444BFFFF    72002912 IText R/O    main:text
0x444C59A0 0x47265C1F    47841920 IData R/W    main:data
0x47265C20 0x47D8C6DF    11692736 IBss   R/W    main:bss
0x47D8C6E0 0x4F1FFFFF   122108192 Local R/W    main:heap
0x80000000 0x8F1FFFFF   253755392 Local R/W    main:(main_k0)
0xA0000000 0xAF1FFFFF   253755392 Local R/W    main:(main_k1)

Cisco 2811 (revision 53.51) with 247808K/14336K bytes of memory.

How could that be? There is 256M in the router and show region shows two blocks with 253755392 size.

May be it some devices blocks?

BadVaddr : A5408844 it falls into last region.

Michael Simon · ‎04-28-2011

We realloate memory using a memory controller into virtual memory addresses.

As a result the actual addresses used will be different than if we simply had a block of physically addressed memory.

You have a physical memory failure and should get an RMA or if you do not have a support contract you should buy replacement memory.

........Mike

Utair Corporation · ‎04-28-2011

Thank you very much, Mike!

I've already replaced 2811 with another. Will try to test memory on some PC if it will fit.

Is there some memory testing tool on the router? Maybe some special image?

Michael Simon · ‎05-03-2011

There is no special image to test memory.

As the system boots it does a quick scan thru the memory. But this is not an exhaustive failure analysis.

There is an old memory command I have not tried in a very long time:

router# show memory scan

This comand basically just runs a parity error check.

We treat these events as an issue requiring an RMA because the failure that results in a bus error is not a parity error. The error condition is that the memory failed to respond to the request to read or write rather than the result was inaccurate.

This is likely a failure of the hardware supporting access to the memory rather than the memory itself. That makes this a more severe event than a simple parity error. This is why we replace the router and memory.

The memory itself does not necessarily have any issue or failure.

If you find my reply answered your question please mark it as answered.

........Mike

Utair Corporation · ‎05-03-2011

Memory test on PC did not reveal any errors.

What's RMA and it's terms? We bought this routers several years ago without any service contracts.

Michael Simon · ‎05-04-2011

Return Material Authorization or RMA is how we replace broken equipment that is either in warranty or under a support contract.

If you had that as a possibility you would open a TAC case to have the RMA prepared. Typically the next day the replacement would be delivered and that would come with the return packaging and a label for the return hardware.

Not having a support contract and past the warranty for the hardware you can buy a single event support. Call +1 800-553-2447 option 3 and ask for Service Relations. You will need to be able to tell them the chassis serial number and describe the failure.

The person you speak with initially will open a TAC case and pass you to a service relations person who can explain the costs and how it works. They would accept payment and pass you to a TAC engineer to basically do the invesitgation I did and they would prepare the RMA.