-Is this recap correct? Yes, - Page 2

WILLIAM STEGMAN · ‎05-13-2016

We've deployed DMVPN across our MPLS network, each branch has an EIGRP peer of a hub router at 2 data centers. Each of the hubs is an ASR 1002X running identical IOS image. One of the 2 hub ASRs consistently flaps with with all the branches. Not all branches go down at the same time, it staggers throughout the day. I've adjusted the timers on the tunnel interface to as high as 60 secs for hello and 180 for hold time, and have upped the bandwidth percentage to 500%. I created a 2nd tunnel interface, GRE only, on the problem hub and one of the branches and added that network to EIGRP and the link stayed up. I then added the DMVPN profile to the tunnel interface, and it continued to stay up. I also checked with the service provider to see if any QoS drops on CS6 that might have been accounting for the flapping, but nothing. Just in case, I began marking the EIGRP traffic in our critical data class (a CoS we have with the SP that I can monitor for drops) and there were no drops for that class. I also tried swapping the router with an RMA, but it continues to flap. So the best I can tell, this is not transport or hardware related. The other data center ASR hub is working fine, so either there is a unique combination of properties on that ASR, or there is something wrong in the EIGRP configuration behind that ASR (data center network for example) that is presenting itself on that particular ASR and not the other. I'm out of ideas and was hoping someone might have been through something similar and has a lead on what I might try next.

MTA-DMVPN-MPLS# sh run int tu61
Building configuration...

Current configuration : 737 bytes
!
interface Tunnel61
description MTA GRE/IPSEC Tunnel via MPLS to Remotes
bandwidth 102400
ip address 10.10.4.2 255.255.252.0
no ip redirects
ip mtu 1400
ip nbar protocol-discovery
no ip next-hop-self eigrp 10
no ip split-horizon eigrp 10
ip pim nbma-mode
ip pim sparse-mode
ip nhrp authentication 123456
ip nhrp map multicast dynamic
ip nhrp network-id 1234
ip nhrp holdtime 600
ip nhrp redirect
ip tcp adjust-mss 1360
tunnel source GigabitEthernet0/0/2.3233
tunnel mode gre multipoint
tunnel key 1234
tunnel vrf IWAN_MPLS
tunnel protection ipsec profile DMVPN-PROFILE-IWAN_MPLS
end

Block nets with either tag 60,61, 160, 161 from entering any DMVPN router

thank you

Carlos Villagran · ‎05-16-2016

Hi!

Once I had this same issue, the problem was regarding a DSLAM in the last-mile network of the ISP (That is why we see drops only in 1 branch). For some reason it was dropping this packet and some other, that is why I asked you to try pinging branch to branch.

Is this a T1 line or what are you actually using for WAN link?

Regards!

JC

Francesco Molino · ‎05-16-2016

Could you run these commands on ASR:

show platform software status control-processor brief

monitor platform software process rp active

Ok. I just reviewing every notes but you haven't attached a sketch. Could you add 1 in order to have a better understanding including MPLS connection (and route redistribution if you have)

Correct me if I'm wrong. You said that these issues are coming only on 1 hub, right? The other hub is working fine without any disconnection.

Are you able to run a debug on eigrp and nhrp side on 1 spoke and this hub?

These 2 hubs are connected on different physical links, right? Does this link is working good?

By physical link I mean ISP

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

WILLIAM STEGMAN · ‎05-17-2016

MTA-DMVPN-MPLS#sh platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
RP0 Healthy 0.00 0.00 0.07

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
RP0 Healthy 3969320 3324764 (84%) 644556 (16%) 2494392 (63%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
RP0 0 3.89 6.28 0.00 89.62 0.00 0.19 0.00
1 0.99 0.79 0.00 98.10 0.00 0.09 0.00
2 0.40 0.50 0.00 99.10 0.00 0.00 0.00
3 0.20 0.10 0.00 99.70 0.00 0.00 0.00

There is a basic diagram in the initial post. I'm not using any redistribution, it's EIGRP all the way through from the WAN to the data center.

Yes, only 1 hub router is having the issue, the other is working fine.

I have run some debugs, it just reveals that the spoke missed 3 consecutive hellos from 1 hub and sends a peer termination. From the hub, I'm not able to run the debug because of the overhead, since there is no way to filter eigrp packet debug to 1 spoke. I don't believe the issue is the link itself, I have BGP running on the FVRF in order for us to reach Service Provider MPLS services, such as SIP trunking, and that link never bounces.

Yes, the 2 hubs are at different locations, different MPLS connections.

I've modified the timers, hello 5, hold time 60, and the flapping appears to have stopped. I'm not sure if this is a long term solution however.

Francesco Molino · ‎05-17-2016

Sorry for the diagram. I've seen it.

Could you do also sh platform software status control-processor brief on the other hub that works?

Do you have the output of :

sh ip eigrp neigh details

Sometimes EIGRP can flap if IPSEC encryption is not clean. Could you do sh crypto ipsec sa in the situation right now and the previous one before changing hello and hold timer. They should be the same but I just would like to insure this.

In the previous state, could you try to do ping multicast 224.0.0.10 ?

Did you export the show run all config from the 2 hubs and compare them to be sure at 200% that you have the exact config (DMVPN, routing, IPSEC,..)

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

WILLIAM STEGMAN · ‎05-18-2016

ASC-DMVPN-MPLS#sh platform software status control-processor brief
Load Average
Slot Status 1-Min 5-Min 15-Min
RP0 Healthy 0.02 0.02 0.00

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
RP0 Healthy 16337056 3627576 (22%) 12709480 (78%) 4892156 (30%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
RP0 0 5.99 6.59 0.00 87.21 0.00 0.19 0.00
1 1.10 0.50 0.00 98.39 0.00 0.00 0.00
2 1.40 0.30 0.00 98.30 0.00 0.00 0.00
3 0.60 0.30 0.00 99.10 0.00 0.00 0.00

EIGRP-IPv4 VR(IWAN) Address-Family Neighbors for AS(10)
H Address Interface Hold Uptime SRTT RTO Q Seq
(sec) (ms) Cnt Num
126 10.10.5.2 Tu61 47 02:01:47 40 240 0 26135
Version 20.0/2.0, Retrans: 0, Retries: 0, Prefixes: 5
Topology-ids from peer - 0
Topologies advertised to peer: base

Stub Peer Advertising (CONNECTED ) Routes
Suppressing queries

I had looked at IPSec previously, and there were no errors, just a steady count of encaps and decaps.

Yes, I could ping the multicast address. I also saw evidence of the mulitcast occurring through our netflow collector.

yeah, I actually ran them through a file diff checker and went line by line. They are the same.

So far it hasn't flapped with the adjusted timers. I'm happy to see that, but of course don't want to mask the real issue.

Francesco Molino · ‎05-18-2016

Ok the issue seems to be related with hello/hold as it's working fine today.

Just to recap:

Before you had hello every 20s and hold time of 60s --> adjacency flapping

Follow up our discussion, you modified it :

hello every 5s and hold time of 60s.

-Is this recap correct?

On the other Hub and spoke, hello are every 20s?

The hello has to be equal on all neighbors.

In your case, this looks like that your link is not receiving hellos:

1. Before change it was 3 hellos before hold time

2. Now you are sending 4 times mores

The question that come to me is: What type of link do you have on this hub (I know that MPLS)? (bandwidth,...)

Are you sure the link ain't congested?

Could you test with increasing the hello timers (something between 12 to 15 sec).

Also

In comparison, the one that worked was the primary (if understood good) and the memory usage is low. The one on which there were flapping has 84+% of memory used.

Which processes are running? Why it's higher in process than the other?

Thanks.

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

WILLIAM STEGMAN · ‎05-20-2016

-Is this recap correct? Yes, current hellos are 5 and hold time is 60.

Yes, other hub and all spokes are all 20 and 60.

The type of link is Ethernet, 70Mbit. I've looked at usage reports and it peaks at around 20 %.

GigabitEthernet0/0/2 is up, line protocol is up
Hardware is 6XGE-BUILT-IN, address is fc5b.3940.8302 (bia fc5b.3940.8302)
Description: MPLS ATT
MTU 1500 bytes, BW 102400 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 33/255, rxload 23/255
Encapsulation 802.1Q Virtual LAN, Vlan ID 1., loopback not set
Keepalive not supported
Full Duplex, 1000Mbps, link type is force-up, media type is LX
output flow-control is on, input flow-control is on
ARP type: ARPA, ARP Timeout 04:00:00
Last input 02:22:09, output 00:03:16, output hang never
Last clearing of "show interface" counters 3d21h
Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 9257000 bits/sec, 8147 packets/sec
5 minute output rate 13296000 bits/sec, 3865 packets/sec
1167336956 packets input, 176490258605 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
609793578 packets output, 286094675411 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

I've increased the hello to 15. No flaps. With 20 hello, I could have expected a bunch of flaps by now.

You're right, the memory utilization is excessive. I just realized that is due to the RMA Cisco sent me. They sent me a router with considerably less memory than the one we initially purchased.

Francesco Molino · ‎05-20-2016

Ok. Let's check what's going on on this router, why the memory is more utilized.

If it's backup and no traffic should pass to it, could you deactivate some feature and keep only the DMVPN with eigrp. Otherwise, you need to add memory and check again with 20 hello.

While you are experiencing flapping with 20 hello, are you able to do a wireshark trace (directly on the router but you have memory lack or by spanning this port to a wireshark machine). I would like to see if hello are received (answer should be yes) and if router is replying back (sometimes no as there are disconnection, due to memory lack?).

Could you do a script EEM in order to capture cpu and memory utilization? the EEM trigger will be the EIGRP disconnection.

Are you comfortable with that?

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

Francesco Molino · ‎05-16-2016

Have you checked routing table? If you have routing loops, you can have eigrp flapping issues.

Could you verify that there is no routing loops from that particular hub?

All spokes attached to this Hub have their EIGRP flapping?

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

WILLIAM STEGMAN · ‎05-16-2016

routing looks fine. No excessive CPU or mem usage. I have delay setup so that networks prefer the MPLS tunnel interface. I can see the alternative routes in the topology, but they are marked as FS.

TESTBRANCH2911#sh ip eigrp topology 10.8.32.0/19
EIGRP-IPv4 VR(IWAN) Topology Entry for AS(10)/ID(10.250.254.252) for 10.8.32.0/19
State is Passive, Query origin flag is 1, 1 Successor(s), FD is 490073878, RIB is 3828702
Descriptor Blocks:
10.10.4.1 (Tunnel61), from 10.10.4.1, Send flag is 0x0
Composite metric is (490073878/163840), route is Internal
Vector metric:
Minimum bandwidth is 1544 Kbit
Total delay is 1001250000 picoseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1400
Hop count is 1
Originating router is 10.8.62.3
Internal tag is 61
10.10.0.1 (Tunnel60), from 10.118.0.1, Send flag is 0x0
Composite metric is (3342417920/163840), route is Internal
Vector metric:
Minimum bandwidth is 10000 Kbit
Total delay is 50001250000 picoseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1400
Hop count is 1
Originating router is 10.8.62.2
Internal tag is 60

The networks hosted at each data center are unique. All the spokes are set to stub. Yes, all spokes' connections bounce. Not all at the same time, it's rolling.

paul driver · ‎05-21-2016

Hello

Can you post the eigrp logging errors and also the config of a spoke (NHC)

res

Paul

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

WILLIAM STEGMAN · ‎05-23-2016

At the hub, (note that this all happens withing the same second).

May 23 17:24:43.024: EIGRP: Received Peer Termination TLV from 10.118.6.4

May 23 17:24:43.024: EIGRP: Lost Peer: Total 211 (5149/0/210/0/0)
May 23 17:24:43.025: EIGRP: Received QUERY on Tu61 - paklen 50 nbr 10.10.6.4

May 23 17:24:43.025: EIGRP: Neighbor(10.118.6.4) not yet found

May 23 17:24:43.311: EIGRP: Received HELLO on Tu61 - paklen 36 nbr 10.10.6.4
May 23 17:24:43.311: AS 10, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0
May 23 17:24:43.311: EIGRP: Add Peer: Total 240 (5181/0/239/0/0)
May 23 17:24:43.311: EIGRP: Received Peer Info from AS 655361(10.10.6.4), new peer
May 23 17:24:43.311: EIGRP: Adding stub (240 Peers, 238 Stubs)
May 23 17:24:43.311: EIGRP: Add Peer: Total 240 (5181/0/240/0/0)
May 23 13:24:43.311 edt: %DUAL-5-NBRCHANGE: EIGRP-IPv4 10: Neighbor 10.10.6.4 (Tunnel61) is up: new adjacency
May 23 17:24:43.311: EIGRP: Enqueueing UPDATE on Tu61 - paklen 0 nbr 10.10.6.4 tid 0 iidbQ un/rely 0/1 peerQ un/rely 0/0
May 23 17:24:43.312: EIGRP: Received UPDATE on Tu61 - paklen 0 nbr 10.10.6.4

The branch will just indicate that 3 hellos were missed and that it's sending a peer termination.

356250: May 23 2016 13:24:42.908 EDT: %DUAL-5-NBRCHANGE: EIGRP-IPv4 10: Neighbor 10.10.4.2 (Tunnel61) is down: holding time expired
356251: May 23 2016 13:24:42.908 EDT: EIGRP: Lost Peer: Total 2 (104/0/1/0/0)

356252: May 23 2016 13:24:43.008 EDT: EIGRP: Build goodbye tlv for 10.10.4.2

356255: May 23 2016 13:24:43.260 EDT: EIGRP: Received HELLO on Tu61 - paklen 532 nbr 10.10.4.2

Carlos Villagran · ‎05-22-2016

Hi!

Have you tried changing the ipsec profile transform sets to transport mode?

R1(config)#crypto ipsec transform-set cisco ah-md5-hmac esp-aes
R1(cfg-crypto-trans)#mode tranport

Best regards!

JC

WILLIAM STEGMAN · ‎05-23-2016

Hi Carlos, it's already set to mode transport.

Beau Clark · ‎10-04-2018

I am curious to know if you ever found a solution to this problem? I just experienced this issue on my DMVPN where 3/4 of my sites eigrp started flapping. NHRP stayed up working fine, but EIGRP would not work.

A reboot fixed it, but I am curious to know if this is a bug.

EIGRP Flapping