cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1375
Views
0
Helpful
4
Replies

EIGRP neighbor flap over IPSec/GRE - but no recursive routing case.

pgasparovic
Level 1
Level 1

Hello anybody well familiar with EIGRP/GRE, or QoS too!

Notice: I present now some trouble scenario, for which I place just one introductory question, at the end of text. Simultaneously I will do testlab research, planning to perform tests focused especially on QoS handling of EIGRP packets over specific serial link, accompanied with IPSec/GRE.

Problem: We noticed on our links between routers in the production network the EIGRP neighbor flap, which occurs on daily base varying from continuous burst of error message shown under, to sporadical occurences. The link is 2mbps serial WIC-1T (with no explicit bandwith cmd to 2mbps + no "max-res-bw 100" cmd too). I overtook the troubleshooting of this configuration done prev. by somebody other and see it's missing some careful EIGRP traffic handling, also due to both commmands missing + (not shown yet) local policy map applied which marks router originating EIGRP traffic to some IP_Prec, and policy-map on serial interface causing EIGRP routing traffic to act with other one within alloc. space, accord. to its IPPrec (the whole bw allocated to policymap to work with is 75% of iface bandw. by default). P-map will be shown later, if needed.

Link is somewhat errorneous too, but not to such extent like EIGRP flap messages (on daily base), as seen from buffered log.

Messages of flapping (can be seen on opposite box too) :

Bardejov#sh logging | begin Feb 10

Feb 10 04:04:39.550: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial1/0, changed state to down

Feb 10 04:04:49.550: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial1/0, changed state to up

Feb 10 04:04:49.550: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel701, changed state to down

Feb 10 04:04:49.566: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is down: interface down

Feb 10 04:04:49.566: destroy peer: 172.16.7.206

Feb 10 04:04:58.786: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is up: new adjacency

Feb 10 04:04:59.550: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel701, changed state to up

Feb 10 04:08:23.657: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is down: holding time expired

Feb 10 04:08:23.657: destroy peer: 172.16.7.206

Feb 10 04:08:39.941: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is up: new adjacency

Feb 10 04:08:39.961: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is down: K-value mismatch

Feb 10 04:08:39.965: destroy peer: 172.16.7.206

Feb 10 04:08:44.653: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is up: new adjacency

Feb 10 04:16:41.587: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is down: holding time expired

Feb 10 04:16:41.587: destroy peer: 172.16.7.206

Feb 10 04:16:45.771: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is up: new adjacency

CAUSE ASSUMPTION : I think that during total load period on link (at least 15sec = def. EIGRP hold time), due to bad priority or bandwith treatment of EIGRP traffic there comes to neigbour flap (hello packet delaying).

"K-value mismatch" message: On first two days of problem inspection I primarily investigated on reason of this, but according to reports found on web (on CCO not) which point to K-value config. mismatch on both routers eigrp process when using different "metric weights" cmds, I exclude some cause here, and regard this message type a result of some bad code in IOS to present such message. We (or I) never manipulate and configure that command!

Config excerpt (on opposite box it's the same, but other IP addresses) :

interface Tunnel701

ip address 172.16.7.205 255.255.255.252

ip mtu 1600

ip tcp adjust-mss 1370

tunnel source 10.107.0.205

tunnel destination 10.107.0.206

tunnel path-mtu-discovery

crypto map CM2

!

interface Serial1/0

description Bardejov-RLAN,BJ-BJ_NP_16

ip address 10.107.0.205 255.255.255.252

service-policy output POLICY_RLAN

crypto map CM2

crypto ipsec df-bit copy

crypto ipsec fragmentation before-encryption

!

router eigrp 1

passive-interface Serial1/0

passive-interface Loopback207

passive-interface FastEthernet0/0

network 10.207.207.75 0.0.0.0

network 172.16.7.0 0.0.0.255

no auto-summary

!

THIS IS IMPORTANT NOW! :

Bardejov#sh interfaces serial 1/0

Serial1/0 is up, line protocol is up

Hardware is PowerQUICC Serial

Description: Bardejov-RLAN,BJ-BJ_NP_16

Internet address is 10.107.0.205/30

MTU 1500 bytes, BW 1544 Kbit, DLY 20000 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation HDLC, loopback not set

Bardejov#sh interfaces tunnel 701

Tunnel701 is up, line protocol is up

Hardware is Tunnel

Internet address is 172.16.7.205/30

MTU 1514 bytes, BW 9 Kbit, DLY 500000 usec,

reliability 255/255, txload 28/255, rxload 28/255

Encapsulation TUNNEL, loopback not set

Keepalive not set

Tunnel source 10.107.0.205, destination 10.107.0.206

Tunnel protocol/transport GRE/IP, key disabled, sequencing disabled

Tunnel TTL 255

Checksumming of packets disabled, fast tunneling enabled

Path MTU Discovery, ager 10 mins, MTU 0, expires never

Last input 00:00:04, output 00:00:04, output hang never

Last clearing of "show interface" counters never

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 19010

Queueing strategy: fifo

Output queue: 0/0 (size/max)

QUESTION (another may come later):

EIGRP calculates by default 50% of interface bw. to use for its traffic at the maximum. What in this scenario is "first enter" interface in this scenario? I think that it's the GRE iface and its bw of 9kpbs that is substantial for routing behaviour.

Thanks anybody for giving me the hints and possibly analyzing also this thing with me to some depth.

4 Replies 4

ruwhite
Level 7
Level 7

First, on the k value mismatch:

Feb 10 04:08:39.961: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.206 (Tunnel701) is down: K-value mismatch

This could be because of a new feature in EIGRP, committed just recently, called the "goodbye message." With this feature, EIGRP sends a "goodbye" to a neighbor if it is being deconfigured or shut down, to keep its neighbors from waiting on their hold timer to take the neighbor down (it makes the network converge more quickly around a neighbor known to be going down). If a router times out a neighbor due to its hold timer expiring, I think we will also send a goodbye message to the neighbor we are timing out.

We just redid the documentation on CCO to insert a note about this, but I don't see the change out there yet.

"EIGRP calculates by default 50% of interface bw. to use for its traffic at the maximum. What in this scenario is "first enter" interface in this scenario? I think that it's the GRE iface and its bw of 9kpbs that is substantial for routing behaviour."

I EIGRP is going to pull its bandwidth (from which to calculate 50% of) from the interface descriptor, which is the tunnel in this case. Since the tunnel is set to 9kb, EIGRP is only going to use 4.5kb, which, if there are good number of routes here, may not be enough.

The next step is to look at what the logs on the other router are saying about the neighbor reset. There are several possible cases here:

-- If the logs are showing a stuck in active, then you probably need to increase the bandwidth on this link a bit. I doubt this is the case, but it is possible.

-- If the logs indicate that you are taking the other neighbor down because of a hold timer expiration, then you could be seeing a problem with the link dropping too many packets.

Can you get to the other router, the other end of the tunnel? If you post the logs from that router relating to this EIGRP neighbor, we can probably do a little more analysis, and help figure it out more.

:-)

Russ.W

Hi Russ,

thank you for involvement, answering my question (I knew that! :-) and providing with first indicia.

I paste here now log from opposite router from that day, the time is correct, as both boxes are NTP-fed from common source.

Feb 10 04:04:49.534: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: holding time expired

Feb 10 04:05:02.338: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:08:39.957: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Feb 10 04:08:43.469: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:16:45.788: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Feb 10 04:16:49.256: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:20:19.263: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: holding time expired

Feb 10 04:20:31.703: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:25:24.690: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Feb 10 04:25:25.150: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 14:00:20.574: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Today I have just completed my lab with all the cabling and tunnel transport backbone connections, and tomorrow will do the hard overload tests with customer traffic and EIGRP configured. EIGRP is great protocol, but think also very bandwith- and treatment-sensitive - I know that, so I will start the test as a smart guy, then I will introduce it into the hell. :))

Bye now, I'm looking forward to you next comment.

Peter.

Hi Russ,

thank you for involvement, answering my question (I knew that! :-) and providing with first indicia.

I paste here now log from opposite router from that day, the time is correct, as both boxes are NTP-fed from common source.

Feb 10 04:04:49.534: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: holding time expired

Feb 10 04:05:02.338: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:08:39.957: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Feb 10 04:08:43.469: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:16:45.788: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Feb 10 04:16:49.256: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:20:19.263: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: holding time expired

Feb 10 04:20:31.703: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 04:25:24.690: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Feb 10 04:25:25.150: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is up: new adjacency

Feb 10 14:00:20.574: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.16.7.205 (Tunnel701) is down: peer restarted

Today I have just completed my lab with all the cabling and tunnel transport backbone connections, and tomorrow will do the hard overload tests with customer traffic and EIGRP configured. EIGRP is great protocol, but think also very bandwith- and treatment-sensitive - I know that, so I will start the test as a smart guy, then I will introduce it into the hell. :))

Bye now, I'm looking forward to you next comment.

Peter.

Okay, so one end is reporting k value mismatches, and some hold timer expirations, and the other end is reporting hold timer expirations. I'd say you're having a problem getting packets across this link. :-( You could wind your hello timers out, so you can drop more packets without killing the neighbor, but I'm not certain how much of a help this will be. Perhaps setting the hello interval down much lower, but leaving the hold timer up mich higher, so there's a 4 or 5 to 1 ratio, rather than a 3 to 1.

Of course, if you're losing this many packets across the link, it doesn't tend to make me think it's going to work well for data, either. At least you aren't getting retransmission timeout exceeds, which would be much harder problem to diagnose and try to fix....

Anyway, my next step would be to look at the interface counters, and see what's up there. Are we really losing that much traffic on the link? If you ping across the link, are you seeing a lot of dropped packets, or does it look like it's just EIGRP having a problem on this link? If it's just EIGRP, I would increase the nadwidth percent, and play with the hello and hold timers (above), and see if I could it to stabilize. I would reduce the number of routes being transmitted across the link to the minimum possible (which you may have already done).

I would possibly try unicast neighbors--it shouldn't matter on a point-to-point link like a tunnel, but it might. I would also make certain I test a full query range across the link. There's no point in stable steady state neighbors if a pull of a cable, causing a full set of queries to be sent across the link, is going to reset the neighbors....

Anyway, this might be getting long'ish on the forum (?). While I don't mind continuing here, you can also email me off line with more information, if you want, and I might be able to offer suggestions or help.

:-)

Russ.W

riw@cisco.com