Solved: DMVPN intermittent dmvpn state changes

les_davis · ‎04-04-2011

We are running a DMVPN dual hub and spoke configuration using ASR router for the hubs and 2811 routers for the spoke routers. We have recently gone past 3000 remote locations and have discovered an issue that we are struggling with. On the some spoke routers (we don't know for sure how many) we are seeing that the show dmvpn in some cases responds with IKE or NHRP with one of the hub peers (see below)

ro1-13349#sho dmvpn
Legend: Attrb --> S - Static, D - Dynamic, I - Incomplete
        N - NATed, L - Local, X - No Socket
        # Ent --> Number of NHRP entries with same NBMA peer
        NHS Status: E --> Expecting Replies, R --> Responding
        UpDn Time --> Up or Down Time for a Tunnel
==========================================================================

Interface: Tunnel1, IPv4 NHRP Details
IPv4 Registration Timer: 30 seconds

IPv4 NHS: 10.1.0.1 RE
Type:Spoke, Total NBMA Peers (v4/v6): 1

# Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb Target Network
----- --------------- --------------- ----- -------- ----- -----------------
1 A.B.C.D 10.1.0.1 UP 6d14h S 10.1.0.1/32

Interface: Tunnel2, IPv4 NHRP Details
IPv4 Registration Timer: 30 seconds

IPv4 NHS: 10.2.0.1 E
Type:Spoke, Total NBMA Peers (v4/v6): 1

# Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb Target Network
----- --------------- --------------- ----- -------- ----- -----------------
1 A.B.C.D 10.2.0.1 IKE 3w6d S 10.2.0.1/32

The state will cycle between IKE and NHRP and UP. We have captured the data 3 times from our 3000+ connections and we have seen about 15 to 20 on each data capture with 1 location that has been on each list.

Is there any additional logging that could help determine the cause? We recently added dmvpn logging on 32 branches and the typical message we see is the following

Apr 4 10:34:29.619 CDT: %DMVPN-5-NHRP_NHS: Tunnel2 10.2.0.1 is DOWN
Apr 4 10:35:53.048 CDT: %DMVPN-3-NHRP_ERROR: Registration Request failed for 10.2.0.1 on Tunnel2

In some cases we get the following

Apr 4 14:28:40.558 CDT: %DMVPN-7-CRYPTO_SS: Tunnel2-A.B.C.D socket is DOWN

Clearing crypto sessions or a shut no shut on the tunnel rarely fixes the problem. If it does the issue comes back. We are using a mix of pre-shared key and CA crypto authenticaion. We are using Version 12.4(24)T1 as the IOS based on other issues.

Please provide any insight you may have on this type of issue. I will add more as we uncover more information or have any pertinent data to add.

Marcin Latosiewicz · ‎04-07-2011

Les,

I'm afraid my expertiese lies in troubleshooting rather than monitoring.

Is SNMP an option? (I don't believe there's much tagetted for DMVPN)

I've been thinking of something similar to this:

http://www.cisco.com/en/US/docs/ios/sec_secure_connectivity/configuration/guide/sec_dmvpn_tun_mon.html#wp1055877

(although not sure how well ASR suppoorts this)

Regarding conditional debuggin and debugging at all.

There's one debugging you can usually safely enable "debug crypto isa err" which only shows error parts of IKE negotiation.

For conditional debugging. We can narrown down debugging to particular peers vrfs,interfaces or even particular connections - this would require however that we already know if/which particular spokes are affected more than others.

PINGER#debug nhrp condition ?
interface based on the interface
peer based on the peer
vrf based on the vrf

and

PINGER#debug crypto condi ?
connid     IKE/IPsec connection-id filter
fvrf       Front-door VRF filter
isakmp     Isakmp profile filter
ivrf       Inside VRF filter
local      IKE local address filter
peer       IKE peer filter
reset      Delete all debug filters and turn off conditional debug
spi        SPI (Security Policy Index) filter
unmatched Output debugs even if no context available
username   Xauth or Pki-aaa username filter

I mostly rely on "debug crypto condition peer ipv4"

Marcin

View solution in original post

Marcin Latosiewicz · ‎04-04-2011

Les,

If a device fails to register with NHRP it will teardown IKE and re-start after a moment.

The code entitity which is binding NHRP and crypto (when using tunnel protection) is called ... crypto socket.

So when there is a problem like this collect:

- show crypto socket

- show ip nhrp brief

- show crypto isa sa

- show crypto ipsec sa.

Since only a few spokes are affected I would say it's a problem with spoke software ... most likely (but we'd need debugs to confirm)

On ASR check datapath during failure:

sh pl ha qf ac fe ipsec data drop
sh plat hard qfp act stat drop | e _0_

(taken a few times)

I would also suggest to run debugs for NHRP.

To confirm if this is a problem with crypto socket you can do one neat trick.

Remove ALL tunnel interface configuration (no interface tunnel X) . Wait a moment, paste it back in.

If it's a problem with crypto socket. This will cause new socket to be added.

That being said I think you might want to open a TAC case, it's hard to torubleshoot those without life access.

Marcin

les_davis · ‎04-05-2011

Our hub environment also consists of a 6509 with an ACE module that distributes the dmvpn connections from our branches. We are currently configured for 1000 connections for each ASR and we currently have 6 per data center.

We do have a tac case opened at this time. I do have a few questions concerning the technology and the show commands.

1. Is there any order between the NHRP establishing a connection and the IPSEC negotiating the crypto session?

From what the show commands are telling me I think we have the NHRP but we don't have crypto (see below)

sho ip nhrp detail
10.1.0.1/32 via 10.1.0.1
   Tunnel1 created 4w0d, never expire
   Type: static, Flags:
   NBMA address: A.B.C.D 10.2.0.1/32 via 10.2.0.1
   Tunnel2 created 17:21:34, never expire
   Type: static, Flags:
   NBMA address: A.B.C.D

ro1-13349#sho crypto sockets

Number of Crypto Socket connections 2

Tu1 Peers (local/remote): E.F.G.H/A.B.C.D

       Local Ident (addr/mask/port/prot): (E.F.G.H//255.255.255.255/0/47)
       Remote Ident (addr/mask/port/prot): (A.B.C.D/255.255.255.255/0/47)
       IPSec Profile: "iGBN"
       Socket State: Open
       Client: "TUNNEL SEC" (Client State: Active)
   Tu2 Peers (local/remote): E.F.G.H/A.B.C.D
       Local Ident (addr/mask/port/prot): (E.F.G.H//255.255.255.255/0/47)
       Remote Ident (addr/mask/port/prot): (A.B.C.D/255.255.255.255/0/47)
       IPSec Profile: "iGBN"
       Socket State: Closed
       Client: "TUNNEL SEC" (Client State: Active)

Marcin Latosiewicz · ‎04-05-2011

Les,

"show ip nhrp" shows you NHRP mapping which is static from spoke to hub (via commands applied to tunnel interface).

Now in order for NHRP to register to hub, you need to establish IPsec session.

On top of the IPsec session you will be able to run GRE packets.

Now from here you send NHRP registration.

If NHRP registration fails, we will teardown IPsec and try again.

I see you're pointing out "Socket State: Closed " I find it a bit strange that both tunnels you indicate have same pair of source and destinatio...

Local Ident (addr/mask/port/prot): (E.F.G.H//255.255.255.255/0/47)
Remote Ident (addr/mask/port/prot): (A.B.C.D/255.255.255.255/0/47)

I see you're also using same profile on both tunnels. are you using "shared" keyword on tunnel protecton?

Marcin

P.S.

Can you share TAC SR? If I will have the time I'll have a look.

les_davis · ‎04-05-2011

The case is SR 617330875.

We are not using the shared keyword. However we have been using this basic tunnel configuration and protection for over a year. This issue just started happening the past 3 weeks or so.

Marcin Latosiewicz · ‎04-05-2011

Les,

I've had a look at what Jay did in the SR. He seemed to have pointed out two problems including one where ESP packets from hub will never make it to the spoke.

If this is the case registration will fail indeed.

BTW, regarding "shared" keyword I was refering to this:

http://www.cisco.com/en/US/docs/ios/sec_secure_connectivity/configuration/guide/share_ipsec_w_tun_protect_ps6441_TSD_Products_Configuration_Guide_Chapter.html

Although you're running phase 1 DMVPN in which case it should not be be strictly speaking required.

Not sure if I can add more value to the analysis Jay has already done (and he seemed to have been quite thorough).

Marcin

Marcin Latosiewicz · ‎04-06-2011

Les,

Just for sake of calrity, if you have questions feel free to ask. I'm not sure I will be able to introduce new value based on information you provided to Jay.

If you have questions about how things should work in theory, feel free ;-)

Marcin

les_davis · ‎04-06-2011

Marcin,

We currenty have IOS based IPS and Zone-based-firewall enabled (split tunnelling). We are seeing lots of log messages from the firewall and a lesser amount from the IPS. We are looking into any logging that would be helpful with identifing a DMVPN, NHRP, or Crypto issue. We would like to do this without filling up the log with duplicate data and help with troubleshooting.

Can you provide any insight on what logging is most relevant? We have enabled logging for IPS and the Firewall but we are interested in also logging significant events with the NHRP, Crypto, Routing (EIGRP). Currentlly Crypto and EIGRP logging have been enabled. We have found DMVPN logging recently and wonder if it would provide any additional insights.

Please provide any documentation concerning the crypto logging and the dmvpn logging. Is one better than the other?

Marcin Latosiewicz · ‎04-06-2011

Les,

Logging DMVPN could bring some more info, but I'm afraid nothing can be better then debugging in this case.

Both NHRP and crypto support conditional debugging. So if there is particular spoke affected more often than other or if you want to narrow down to particular interface... there is that possibility.

logging DMVPN is a mix of crypto and NHRP information, much like "show dmvpn" which shows a bit of everything and "show ip nhrp" + "show crypto session" etc etc which should be more precise.

"show cef interface ..." (on newer interfaces) will show you input and output features applied to interface (if you want to check what is done step by step).

Not sure if I can be of more help without going into troubleshooting particular features. I don't have a router handy, nor do I beleive there's an updated DMVPN troubleshooting document.

Marcin

les_davis · ‎04-07-2011

I understand that debugging is the best method for troubleshooting this type of issue.

We have since discovered that the dmvpn "outages" have been rolling through all of our remote locations. This is looking more like a hub issue. We have taken steps to help isolate the cause.

My questions about logging was more for long term support. There are lots of logging we can enable but what is best without duplicating the log information.

Also you mentioned conditional debugging on the hub router. I do not know how to accomplish this. Can you provide some documentation on conditional debugging based on connection on a Hub router in a hub-and-spoke topology?

Marcin Latosiewicz · ‎04-07-2011

Les,

I'm afraid my expertiese lies in troubleshooting rather than monitoring.

Is SNMP an option? (I don't believe there's much tagetted for DMVPN)

I've been thinking of something similar to this:

http://www.cisco.com/en/US/docs/ios/sec_secure_connectivity/configuration/guide/sec_dmvpn_tun_mon.html#wp1055877

(although not sure how well ASR suppoorts this)

Regarding conditional debuggin and debugging at all.

There's one debugging you can usually safely enable "debug crypto isa err" which only shows error parts of IKE negotiation.

For conditional debugging. We can narrown down debugging to particular peers vrfs,interfaces or even particular connections - this would require however that we already know if/which particular spokes are affected more than others.

PINGER#debug nhrp condition ?
interface based on the interface
peer based on the peer
vrf based on the vrf

and

PINGER#debug crypto condi ?
connid     IKE/IPsec connection-id filter
fvrf       Front-door VRF filter
isakmp     Isakmp profile filter
ivrf       Inside VRF filter
local      IKE local address filter
peer       IKE peer filter
reset      Delete all debug filters and turn off conditional debug
spi        SPI (Security Policy Index) filter
unmatched Output debugs even if no context available
username   Xauth or Pki-aaa username filter

I mostly rely on "debug crypto condition peer ipv4"

Marcin

les_davis · ‎04-07-2011

Thanks for your insight Marcin. Really appreciated.

eiwanski · ‎10-07-2012

Les,

You said you were taking steps to isolate this issue, did you ever find a definite resolution? I have an almost identical situation as you described above and still have not found an answer from TAC or with my own troubleshooting debugging yet.

Thanks,

Ed

les_davis · ‎10-08-2012

I will have to brush off the cobwebs and provide you a more thorough answer. It was discovered that we were seeing these rolling outages across several ASR devices. I'm not sure but I think it was an IOS issue related to the ASR. There is another group responsible for the hub devices. I will check with them and see what was done.

eiwanski · ‎10-29-2012

Les / all

I am posting what we found as our resolution in order to hopefully help others. Les I am not sure if you can update with multiple correct answers but this has completely stabilized our issues

The culprit for us was a NHRP default setting on the tunnel interfaces vs. our customized NHRP holdtime settings. If you run a large DMVPN environment with many spokes keep this setting in mind. We run approximately 1100 spokes in a dual cloud setup. Based on recommendations from Cisco we tuned our NHRP holdtime to 300 seconds. In itself this in not a bad setting as it means each spoke will get a NHRP advertisement once every 100 seconds (1/3 holdtime). However, the default setting of the 'ip nhrp max-send' command is 100 packets / 10 seconds. This is meant as a soft-cap to prevent NHRP from starving router resources in the case of a spoke gone wild, large reconvergence event, etc. Our issue was that by changing the holdtime we needed to send out ~1100 nhrp packets every 100 seconds. This is of course results in 110 packets every 10 seconds - 10% more than the limit causing some of the traffic to be dropped. Eventually this resuled in spokes having an occasional tunnel reset due to hitting a timeout. So keep this setting in mind if you or are approaching upwards of 1000 spokes and have a holdtime of 300 or less, it is a per-tunnel interface setting. You can see what is happening in your environment by using the 'show ip nhrp traffic' command.

[default settings]

rtr#show ip nhrp traffic interface tunnel 1

Tunnel1: Max-send limit:100Pkts/10Sec, Usage:0%

After increasing this to a more appropriate value things have been perfectly stable.

Thanks!

Ed