ROM Error crashes two routers at the same time

Ricky S · ‎10-31-2012

Hi folks,

We are a medium-large sized company with approx.100 offices located across North America. Every single office connects to each other and the data center via a DMVPN overlay network. The DMVPN hub router (Cisco 2951), R-Q9-1, is located at our data center and is the "workhorse" of the company. We also have a redundant hub router, R-Q9-2, at the data center with exact same hardware specs.

Each office builds an EIGRP Tunnel0 and Tunnel1 to R-Q9-1 and R-Q9-2 respectively. All data traffic flows over Tunnel0 to R-Q9-1 until it fails, at which point traffic starts to flow over Tunnel1 to R-Q9-2. This has been working seemlessly for last 1 year of implementing this design, until yesterday, when both R-Q9-1 and R-Q9-2 rebooted all of a sudden at the same time right in the middle of a production day. I confirmed there was no power failure at the data center. A show version of both routers gives me this:

R-Q9-1 uptime is 1 day, 1 hour, 24 minutes

System returned to ROM by address error at PC 0x5C92E28, address 0x5DF36FE9 at 11:52:23 EDT Tue Sep 21 2010

System image file is "flash0:c2951-universalk9-mz.SPA.150-1.M3.bin"

Last reload type: Normal Reload

R-Q9-2 uptime is 1 day, 1 hour, 31 minutes

System returned to ROM by address error at PC 0x5C92E28, address 0x4582626D at 09:20:07 EST Tue Jan 10 2012

System image file is "flash0:c2951-universalk9-mz.SPA.150-1.M3.bin"

Last reload type: Normal Reload

I Googled that error but can't find anything specific other than it's saying it's some kind of bus error.

What I also found a bit off is the time it's showing on both routers' show version output (Sep 21, 2010 and Jan 10, 2012)

Here are the current clock settings on both routers at the time of this writing.

R-Q9-1#sh clock

*16:53:31.216 EDT Wed Oct 31 2012

R-Q9-2#sh clock

*16:59:44.220 EDT Wed Oct 31 2012

I checked my Syslog server and did not find anything specific during the time of the crash, however, syslog was filled with errors similar to this one

2469: * Tunnel0: NHRP Encap Error for Resolution Request , Reason: protocol generic error (7) on (Tunnel: 10.10.200.1 NBMA: IP address ommitted)

2468: * Tunnel0: NHRP Encap Error for Resolution Request , Reason: protocol generic error (7) on (Tunnel: 10.10.200.1 NBMA: IP address ommitted)

2467: * Tunnel0: NHRP Encap Error for Resolution Request , Reason: protocol generic error (7) on (Tunnel: 10.10.200.1 NBMA: IP address ommitted)

I have never seen these errors before and all of a sudden they seem to have stopped since this morning.

Please let me know if you guys can figure this one out because I'm completely lost. I'm trying to find out why it happened and if it will happen again.

Ivan Shirshin · ‎10-31-2012

Hi Ricky,

The "show version" indicates that the routers crashed due to an address error.

System returned to ROM by address error at PC 0x5C92E28, address 0x4582626D

Address errors happen when the software tries to access data on incorrectly aligned boundaries; 2-byte and 4-byte accesses are allowed only on even addresses. Such error usually indicates a software bug.

Such crash should generate a crashinfo file on the router flash file system (e.g. "flash" or "bootflash", depending on the platform). Could you upload it to the thread?

- check available file systems with "show file system"

- list files in the file system, e.g. "dir flash:"

- ftp file or collect the terminal output of the command "more flash:"

Also, "show region" output is needed to check the address space.

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

Ricky S · ‎10-31-2012

Hi Ivan, thanks for your response. I have attached the requested information. Also below is sh region output from both routers.

R-Q9-1#sh region
Region Manager:

      Start         End     Size(b) Class Media Name
0x00000000 0x1DBFFFFF   499122176 Local R/W    main
0x01000000 0x03FFFFFF    50331648 Local R/W    main:heap
0x040001AC 0x089F1A7B    77535440 IText R/O    main:text
0x09000000 0x0BFFFFFF    50331648 Local R/W    main:heap
0x0C000000 0x10E851C3    82334148 IData R/W    main:data
0x10E851C4 0x11A4B8E3    12347168 IBss   R/W    main:bss
0x11A4B8E4 0x1DBFFFFF   203114268 Local R/W    main:heap
0x1DC00000 0x1FFFFFFF    37748736 Iomem R/W    iomem

Free Region Manager:

Start End Size(b) Class Media Name

R-Q9-2#sh region
Region Manager:

      Start         End     Size(b) Class Media Name
0x00000000 0x1DBFFFFF   499122176 Local R/W    main
0x01000000 0x03FFFFFF    50331648 Local R/W    main:heap
0x040001AC 0x089F1A7B    77535440 IText R/O    main:text
0x09000000 0x0BFFFFFF    50331648 Local R/W    main:heap
0x0C000000 0x10E851C3    82334148 IData R/W    main:data
0x10E851C4 0x11A4B8E3    12347168 IBss   R/W    main:bss
0x11A4B8E4 0x1DBFFFFF   203114268 Local R/W    main:heap
0x1DC00000 0x1FFFFFFF    37748736 Iomem R/W    iomem

Free Region Manager:

Start End Size(b) Class Media Name

Ivan Shirshin · ‎10-31-2012

Hi Ricky,

I have checked the crashinfo files and the stack indicates that you are likely hitting the following DDTS:

CSCua45206 Hub crashed while removing Stale Cache entry

Symptom:

Hub router crashes while removing Stale Cache entry

Conditions:

Crash occurs when 2 spokes are translated to same NAT address.

Workaround:

Spokes behind the same NAT box must be translated to different post-NAT Addresses

CSCua45206 is a new DDTS and will be fixed in 15.1(4)M6 (release planned at 03/08/2013). Meanwhile you should use workaround to avoid unsupported network design.

Note the conditions of the DDTS and check if that is applicable to your network configuration. DMVPN spoke routers behind the same NAT box must be NATed to unique outside (post-NAT) IP addresses. It is not supported to have two spokes with the same IP address that they present to the DMVPN hub. Even though NAT-T can handle this case (PAT), NHRP cannot.

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

Ricky S · ‎11-01-2012

Hi Ivan,

We have approx. 100 spokes which I have setup by hand and I can guarantee no 2 spokes share the same post-NAT IP address.

Is there anyway of finding out from the crash logs etc what IP address had 2 spokes behind it?

Could this be some kind of an attack?

rojesara.prashant · ‎02-28-2018

We have observed same crash.

System returned to ROM by address error at PC 0x60304F0, address 0x908300B4

Router Model : CISCO2951/K9

IOS Version : "flash0:c2951-universalk9-mz.SPA.154-1.T1.bin"

Found crash happened after running below command. Although unable to find matching software bug.

Let me know if there is any matching bug for this if you can find.

CMD: 'en' 01:11:58 UTC Thu Mar 1 2018

CMD: 'show ip flow top-talkers 20' 01:12:08 UTC Thu Mar 1 2018

CMD: 'show ip flow top-talkers 20 from-cache main aggregate destination-address ' 01:12:34 UTC Thu Mar 1 2018

01:12:34 UTC Thu Mar 1 2018: Unexpected exception to CPU: vector 1400, PC = 0x60304F0 , LR = 0x6037DEC