Solved: Weird DMVPN error

Xavier Lloyd · ‎04-12-2012

Hey folks,

I've configured a DMVPN recently but the hub router has locked up twice since. After the first time, I decided to capture some logs so that I'd be able to see if anything went wrong right before. Here are the logs that I gathered right before it hung:

2012-04-12 07:38:38 Local7.Warning 10.x.x.4 59: 000055: *Apr 12 02:45:12.999 GMT: %CRYPTO-4-RECVD_PKT_MAC_ERR: decrypt: mac verify failed for connection id=2271 local=81.xx.xx.68 remote=64.xx.xx.66 spi=C4D24C6F seqno=00001C99

2012-04-12 07:39:10 Local7.Notice 10.x.x.4 60: 000056: *Apr 12 02:45:44.903 GMT: %OSPF-5-ADJCHG: Process 1, Nbr 10.x.x.9 on Tunnel0 from FULL to DOWN, Neighbor Down: Dead timer expired

2012-04-12 07:39:11 Local7.Notice 10.x.x.4 61: 000057: *Apr 12 02:45:46.211 GMT: %OSPF-5-ADJCHG: Process 1, Nbr 10.x.y.14 on Tunnel0 from FULL to DOWN, Neighbor Down: Dead timer expired

2012-04-12 07:39:15 Local7.Notice 10.x.x.4 62: 000058: *Apr 12 02:45:50.339 GMT: %OSPF-5-ADJCHG: Process 1, Nbr 10.x.y.13 on Tunnel0 from FULL to DOWN, Neighbor Down: Dead timer expired

2012-04-12 07:40:17 Local7.Notice 10.x.x.4 63: 000059: *Apr 12 02:46:52.299 GMT: %TRACKING-5-STATE: 4 ip sla 2 reachability Up->Down

It looks like it received an invalid VPN packet then all of a sudden the public interface failed. Not sure if the private one failed eventually...but I'm wondering why this could have happened. I did some research around the MAC_ERR message and saw that it's caused by some corruption or modification of the packet. Could a malformed packet have made the router hang up? The router didn't crash, but just became unresponsive and needed to be rebooted.

Any ideas are appreciated!

Thanks :-)

~ Xavier

Mohammad Abazeed · ‎04-16-2012

Hmm ... this sounds interesting ... this might be caused by a wrong pre-shared key sometimes ... corrupted encryption cause this as well ... or fast-path switching ... or a bug

Double check the pre-shared keys again ...

Try to disable the fast-path switching (no ip route-cache) ...

Did the tunnel come up after the reload? and do you see these msgs often?

If yes, what is the frequency of these msgs? is it 60 sec by chance??

Can you threw your router's config as well as the IOS version?

/Mo.

View solution in original post

Mohammad Abazeed · ‎04-16-2012

Hmm ... this sounds interesting ... this might be caused by a wrong pre-shared key sometimes ... corrupted encryption cause this as well ... or fast-path switching ... or a bug

Double check the pre-shared keys again ...

Try to disable the fast-path switching (no ip route-cache) ...

Did the tunnel come up after the reload? and do you see these msgs often?

If yes, what is the frequency of these msgs? is it 60 sec by chance??

Can you threw your router's config as well as the IOS version?

/Mo.

Xavier Lloyd · ‎04-16-2012

These are the timestamps on the most recent messages:

Apr 12 14:15:23.006

Apr 15 00:06:48.159

Apr 15 00:17:23.567

The router hasn't locked up since, but I'm seeing these messages that I don't normally see:

000056: *Apr 16 13:34:05.323 GMT: %IP_VFR-4-FRAG_TABLE_OVERFLOW: FastEthernet0/0/1: the fragment table has reached its maximum threshold 16

000057: *Apr 16 14:38:21.220 GMT: %IP_VFR-4-FRAG_TABLE_OVERFLOW: FastEthernet0/0/1: the fragment table has reached its maximum threshold 16

I know for sure that the PSKs are the same. The no ip route-cache command doesn't work (I can't enter it).

Output of show ver:

Cisco IOS Software, C1900 Software (C1900-UNIVERSALK9-M), Version 15.0(1)M3, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Compiled Sun 18-Jul-10 01:47 by prod_rel_team

ROM: System Bootstrap, Version 15.0(1r)M6, RELEASE SOFTWARE (fc1)

LD-1941 uptime is 4 days, 5 hours, 19 minutes

System returned to ROM by power-on

System image file is "flash0:c1900-universalk9-mz.SPA.150-1.M3.bin"

Last reload type: Normal Reload

This product contains cryptographic features and is subject to United

States and local country laws governing import, export, transfer and

use. Delivery of Cisco cryptographic products does not imply

third-party authority to import, export, distribute or use encryption.

Importers, exporters, distributors and users are responsible for

compliance with U.S. and local country laws. By using this product you

agree to comply with applicable laws and regulations. If you are unable

to comply with U.S. and local laws, return this product immediately.

A summary of U.S. laws governing Cisco cryptographic products may be found at:

http://www.cisco.com/wwl/export/crypto/tool/stqrg.html

If you require further assistance please contact us by sending email to

export@cisco.com.

Cisco CISCO1941/K9 (revision 1.0) with 487424K/36864K bytes of memory.

Processor board ID FTX1436825Q

2 FastEthernet interfaces

2 Gigabit Ethernet interfaces

1 Virtual Private Network (VPN) Module

DRAM configuration is 64 bits wide with parity disabled.

255K bytes of non-volatile configuration memory.

254464K bytes of ATA System CompactFlash 0 (Read/Write)

License Info:

License UDI:

-------------------------------------------------

Device# PID SN

-------------------------------------------------

*0 CISCO1941/K9 FTX_______

Technology Package License Information for Module:'c1900'

----------------------------------------------------------------

Technology Technology-package Technology-package

Current Type Next reboot

-----------------------------------------------------------------

ipbase ipbasek9 Permanent ipbasek9

security securityk9 Permanent securityk9

data datak9 Permanent datak9

I have to redact the config which will take a long time because they are production routers...I'll try to work on it in a bit.

I've disabled hardware encryption and I'm doing encryption in the software now and things seem to be fine. I still receive the messages (as shown in the timestamps above) but the router hasn't locked up.

Will update with any further findings! Thanks so much!

Mohammad Abazeed · ‎04-16-2012

Ok, this is a different error now ... I saw that you have the HW acc disabled .. It was the next step for testing the encryption so kudos on that ... It might be it's fault .. not sure thou ... But this will cause overhead processing on the router as you know so I wouldn't leave it like that if you have large traffic on that router.

Now, for the other error msg, this is related to the number of fragments you get on the interface. VFR stands for "virtual fragmentation reassembly", this was originally introduced to be able to detected and scan the fragments on the interface level before passing them thru. So they are assembled first, checked for whatever check the router is configured for (if any) then passed.

When the maximum number of datagrams that can be reassembled at any given time is reached, all subsequent fragments are dropped, and an alert message such as the following is logged to the syslog server: "VFR-4_FRAG_TABLE_OVERFLOW."

I would say that you are receiving a large number of fragmented packets on that interface which caused this msg to appear.

It could be due to a device sending fragmentation that shouldn't be sent, or it might be normal traffic that are fragmented due to size.

You can increase the maximum number using the command "ip virtual-reassembly max-reassemblies " under the respective interface . I would say make it 64 or more if applicable. Some releases has 1024 max number. If however this error persists, you should check your network for fragmentation.

Xavier Lloyd · ‎04-16-2012

Thanks much Mohammad!

There's not too much traffic on the routers so I don't think software encryption should slow it down. The cpu is actually under 5% at the moment so I'm monitoring it throughout the day.

Is there anything further you can tell me on disabling fast switching? There doesn't seem to be a "no ip route-cache" command. Is this the same thing as cef?

Mohammad Abazeed · ‎04-16-2012

It's a per interface command, so will nee to be applied under the inteface rather than gobally ...

check this link, it will tell you more about it's options:

- http://www.cisco.com/en/US/docs/ios/12_3/switch/command/reference/swi_i1.html#wp1110844

HTH,

Mo.

Xavier Lloyd · ‎04-17-2012

Thanks Mo, that explains a lot. I disabled route-cache on both the hub and spoke. However the hub locked up again last night and I didn't get the %CRYPTO-4-RECVD_PKT_MAC_ERR message again. Instead, it just looks like OSPF goes down on my tunnels and IPSLA fails on the physical interface. After this, I can't get to the router directly or through the tunnel but it seems as if the inside interface is up since I'm receiving syslogs...I'm thinking it could be an interface problem...or maybe the router is trying to process a packet on that interface and has problems and locks up?

I really don't know but here are the logs. I have a case going with a TAC engineer at the moment so hopefully I can find something but until then, I like to get as much help as possible :-)

2012-04-17 03:24:02 Local7.Notice 10.x.x.x 65: 000060: *Apr 16 22:30:38.701 GMT: %OSPF-5-ADJCHG: Process 1, Nbr 10.x.x.x on Tunnel0 from FULL to DOWN, Neighbor Down: Dead timer expired

2012-04-17 03:24:04 Local7.Notice 10.x.x.x 66: 000061: *Apr 16 22:30:40.453 GMT: %OSPF-5-ADJCHG: Process 1, Nbr 10.x.x.x on Tunnel0 from FULL to DOWN, Neighbor Down: Dead timer expired

2012-04-17 03:24:10 Local7.Notice 10.x.x.x 67: 000062: *Apr 16 22:30:46.001 GMT: %OSPF-5-ADJCHG: Process 1, Nbr 10.x.x.x on Tunnel0 from FULL to DOWN, Neighbor Down: Dead timer expired

2012-04-17 03:25:00 Local7.Notice 10.x.x.x 68: 000063: *Apr 16 22:31:36.673 GMT: %TRACKING-5-STATE: 4 ip sla 2 reachability Up->Down

Mohammad Abazeed · ‎04-20-2012

Sorry for the delay, it has been busy around here the past few days ....

So, now things are changing out for us here. The OSPF is now dropping down and the tunnel stays up ... yay for the tunnel part ...

I might not be the person to give you routing advises here, so I would say "if you haven't yet" wait for the TAC answer on this. Hope you have a routing not a VPN case thou, as I can see here this is a routing issue now.

Not sure if you still require any addtional information with regards to the tunnel action here, or if you still need to add anything to this post, if you do please feel free to do and if I was able to help out I will.

Wishes,

Mo.

Xavier Lloyd · ‎04-20-2012

Boy...I know what you mean. I'm up to my ears in work too

In any case, I found the problem. The interface was actually negotiating half duplex so there were a whole lot of drops and errors. I'm not sure if all that was cause by the half duplex or by the interface itself though but what we ended up doing was switching the ports. The Internet-facing interface was configured on a fast ethernet card so we switched it to an onboard gigethernet port and specified full duplex/100 and everything was fine after.

I had to put back on the route-cache commands to turn on netflow and surprise surprise...the old errors are back. Anyway, TAC suggested that it was just cosmetic and shouldn't interfere with anything.

Thanks again!