Solved: Re: OSPF Adjency stuck in EXCHANGE/EXSTART states

bobbydazzler · ‎12-29-2017

Hello!

I am facing a problem with an OSPF link between an ASR and a 6509. It has been working fine for months and all of a suden the process went down, and is stuck in EXCHANGE/EXSTART states . I have another OSPF link (for redundancy purpose) between these two devices that is still working fine.

Interface on the ASR:

interface GigabitEthernet0/0/2
 description -- 6509 Gi3/1
 ip address xxx.xxx.247.243 255.255.255.254
 no ip redirects
 no ip unreachables
 no ip proxy-arp
 ip nat outside
 ip portbundle outside
 ip ospf authentication message-digest
 ip ospf message-digest-key 1 md5 7 0607032E444F071C11
 ip ospf network point-to-point
 load-interval 30
 negotiation auto
end

Interface on the 6509:

interface GigabitEthernet3/1
 description -- ASR1002 Gi0/0/2
 ip address xxx.xxx.247.242 255.255.255.254
 no ip redirects
 no ip unreachables
 no ip proxy-arp
 ip ospf authentication message-digest
 ip ospf message-digest-key 1 md5 7 0607032E444F071C11
 ip ospf network point-to-point
 ip ospf priority 128
end

I can ping between the two without pb (with different MTU size and df-bit set)

From the debug log it seems that the ASR receive the DBD from the 6509 and send it back but the 6509 doesn't, and it keeps on re-sending.

ASR log:

Dec 29 12:07:17.840: OSPF-65535 ADJ   Gi0/0/2: Rcv DBD from xxx.xxx.247.255 seq 0x159E opt 0x50 flag 0x7 len 32  mtu 1500 state EXCHANGE
Dec 29 12:07:17.840: OSPF-65535 ADJ   Gi0/0/2: Send DBD to xxx.xxx.247.255 seq 0x159E opt 0x50 flag 0x2 len 192
Dec 29 12:07:17.840: OSPF-65535 ADJ   Gi0/0/2: Send with youngest Key 1

6509 log:

Dec 29 12:06:54.202: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x159E opt 0x50 flag 0x7 len 32
Dec 29 12:06:54.202: OSPF: Send with youngest Key 1
Dec 29 12:06:54.202: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [12]
Dec 29 12:06:56.410: OSPF: Send with youngest Key 1
Dec 29 12:06:58.714: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x159E opt 0x50 flag 0x7 len 32
Dec 29 12:06:58.714: OSPF: Send with youngest Key 1
Dec 29 12:06:58.714: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [13]
Dec 29 12:06:59.306: OSPF: Send with youngest Key 1

Any suggestion or pointer of things I should check is more than welcome.

Best!

Peter Paluch · ‎01-05-2018

Hi Bob,

The destination LTL of 0x7FFF is a drop index - meaning the packets will be silently discarded. You can check well-known LTL values and ranges using the show platform software ltl well-known-index command.

The trick is now in understanding why the device decided to drop the packets. We could try digging some more into the ELAM, but that is becoming more and more a specialized TAC work.

A blind shot here: Do you by any chance utilize any kind of ingress policing, CoPP (Control Plane Policing), hardware limiters, or anything of the sort? Is it possible that the OSPF packets are getting killed by any kind of ingress CoPP/QoS policy?

Best regards,
Peter

View solution in original post

Georg Pauwen · ‎12-29-2017

Hello,

traditionally this is caused by an MTU mismatch between the interfaces that form the OSPF neighbor. Try and configure 'ip ospf mtu-ignore' on the interfaces...

Another option would be to reenter the MD5 key on both sides...

Julio E. Moisa · ‎12-29-2017

Hi

It looks like as an authentication problem, remove the line: ip ospf message-digest-key 1 md5 7 0607032E444F071C11

And verify again, you could have a not visible space at the end. If it works check the password and the copy and paste from a notepad.

Hope it is useful

:-)

>> Marcar como útil o contestado, si la respuesta resolvió la duda, esto ayuda a futuras consultas de otros miembros de la comunidad. <<

bobbydazzler · ‎12-29-2017

Hi Guys,

Thank you for your reply. An important point I forgot to mention is that when I reboot the 6509 the ospf link is running fine for some minutes before going then. Once it got down, I cannot do nothing to bring it back up but a reboot. Therefore I don't think it is a MTU pb (it would say so in the debug log) or passphrase.

It seems also to affect the bgp routing, as I have a bgp link that goes down as well with the same behaviour (L3 connectivity is fine but the TCP connection to the remote router failed).

I believe we are runnning the latest version of the 12. train : Version 12.2(33)SXJ9

Peter Paluch · ‎12-29-2017

Hi Bob, Georg, Julio,

Please allow me to join.

Indeed, this is unlikely to be an MTU or an authentication issue, especially if the reload of the Cat6K resolves the issue for some time.

As for the BGP problem, let's keep that in mind but at this point we know too little about it to decide if it can be related.

Bob, would you mind answering the following questions for me?

After you reload the Cat6K, the OSPF adjacency comes up initially. Does it truly reach the FULL state on both neighbors?
I understand that this is difficult to test if the devices are in production, but would the reload on the ASR also temporarily resolve the OSPF adjacency problem?
What exactly makes the OSPF session between the ASR and the Cat6K go down? What logging messages are logged on both peers when the OSPF session goes down?
The problem apparently affects unicast communication between the ASR and the Cat6K. Is it possible to confirm that during the time these neighbors are stuck in ExStart/Exchange state, they can still ping each other using their IP addresses on the Gi0/0/2 - Gi3/1 interconnection?
Tying into the previous question, I have sometimes seen issues with NAT configuration that inadvertently allowed unicast IGP packets (EIGRP, OSPF) to be NATted to a different destination, thereby seemingly getting "lost" and breaking the IGP synchronization. This would indeed affect the DBD packets as well. Therefore, while the ASR and the Cat6K are stuck in the ExStart/Exchange state, would it be possible to temporarily remove the ip nat outside command from the ASR Gi0/0/2 interface to see if that resolves the issue? If so, we'd know that the problem is likely NAT-related, and we would need to inspect the NAT configuration closer.
If the suggestion with temporarily disabling the NAT does not resolve the problem, here is another suggestion: While the neighbors are stuck in the ExStart/Exchange state, would it be possible to configure a SPAN session both on the Cat6K Gi3/1 and ASR Gi0/0/2, and record the traffic being passed through this link? We need a hard evidence that the Cat6K is sending the DBD packets out Gi3/1, and that they are arriving to the ASR Gi0/0/2, and vice versa - and we need to precisely understand how exactly those packets look like. It would also be helpful to record the following debug outputs on both the ASR and the Cat6K during the same time the SPAN session is captured:

debug ip ospf adj
debug ip ospf packet

Thank you! Please keep us posted.

Best regards,
Peter

bobbydazzler · ‎12-30-2017

Hello Peter,

Thank you very much for your reply, it truly was a pleasure to read it this morning! You've raised very interesting questions and suggestions, I will reply with the info I have so far.

Actually my immediate concern is that the BGP session to one of our upstream connected on the 6509 goes down as well, but I think this is due to the OSPF pb. That is why I am investigating this issue first.

To give more details of our set up, for redundancy purpose we have two direct (no equipment in between) physical fiber links connecting the ASR and the 6509, with the exact same config (but the IP of interface). One of the two links stays FULL (g4/1-ge0/0/0) whilst the g3/1-ge0/0/2 goes down (ospf wise). Last night I deactivated the g3/1-ge0/0/2 and rebooted the switch to have a clean OSPF relationships on the devices. So as of now, I cannot check in more details. One thing is that it seems to stuck in the memory of the 6509 once it goes down, even if I remove all the config regarding this link, it still appears in the sh ip ospf nei (until I reboot).

Today I have configured a new OSPF session between the 6509 and a 7206VXR, same config and it comes up FULL state without pb. I have also connected the BGP upstream to the 7206VXR and the BGP session is up and running fine as well. When I try on the 6509 and it cannot establish the BGP session with the upstream router (telneting from the 6509 to the upstream router on port 179 doesn't work).

1. What do you mean by "truly reach", I just check the states with the sh ip ospf nei and saw that the states is FULL on both the ASR and 6509, and it stays like that for some time. While this is up I don't see any pb in the debug logs, until it goes down. Is there a way to check in more details the OSPF relationship between the two devices?

2. Good point! I have been wondering if the pb comes from the 6509 or the ASR. Indeed the devices are in production, and it is not easy to reboot the ASR as it has a crucial role in our network as we extensively use it with ISG for handling our clients sessions.

3. Unfortunaltey I don't have debug ospf log when the ospf goes down, I will check it out next time I reload the switch. I just have the basic logs:

Dec 29 07:20:26.696: %OSPF-5-ADJCHG: Process 65535, Nbr xxx.xxx.247.254 on GigabitEthernet3/1 from FULL to DOWN, Neighbor Down: Too many retransmissions
Dec 29 07:21:26.697: %OSPF-5-ADJCHG: Process 65535, Nbr xxx.xxx.247.254 on GigabitEthernet3/1 from DOWN to DOWN, Neighbor Down: Ignore timer expired
CMD: 'sh run ' 07:22:33 GFT Fri Dec 29 2017
Dec 29 07:23:35.690: %OSPF-5-ADJCHG: Process 65535, Nbr xxx.xxx.247.254 on GigabitEthernet3/1 from EXSTART to DOWN, Neighbor Down: Too many retra

4. Yes the icmp is fine, with different MTU and packet size and/or DF-SET.

5/6. Very good point I will try to disable the NAT and do a SPAN to try to find out more why is the OSPF session goes down.

Again thanks a lot for your insights and for taking the time to clearly reply to my post, I have been banging my head on this issue for a couple of days, that brighten my day :-)

I will keep you posted.

Kind regards,

Bob

Peter Paluch · ‎12-30-2017

Hi Bob,

You are very much welcome!

One thing is that it seems to stuck in the memory of the 6509 once it goes down, even if I remove all the config regarding this link, it still appears in the sh ip ospf nei (until I reboot).

I believe I need more clarification. Are you saying that the OSPF neighborship over Gi3/1 remains displayed in the show ip ospf neighbor output on the 6509 even if you remove all "config regarding this link"? What do you exactly mean by the "config regarding this link"?

In any case, checking whether a particular entry in the show ip ospf neighbor is stale is easy: If the "Dead Time" column is being refreshed back to 00:00:40 every 10 seconds (assuming that you haven't changed the default OSPF timers) then the displayed entry is not a stale entry - it is very much alive.

1. What do you mean by "truly reach",

I wanted to make sure that both the 6509 and ASR report each other in the FULL state. OSPF states are only loosely coupled - while neighbor A can treat neighbor B as being in the FULL state, neighbor B can still treat neighbor A as being in the LOADING state. What I wanted to clarify is that both the 6509 and the ASR have been verified to consider each other in the FULL state.

3. Unfortunaltey I don't have debug ospf log when the ospf goes down, I will check it out next time I reload the switch. I just have the basic logs:

That is perfectly okay for now.

The "Too many retransmissions" message on the 6509 both in FULL and in ExStart state suggests that the unicast communication between the 6509 and the ASR gets broken for some reason. The multicast must still be working bidirectionally, otherwise the neighbors would never proceed beyond the INIT state.

Let me ask you: Is it by any chance possible that either the 6509 or the ASR learn about the xxx.xxx.247.242/32 or xxx.xxx.247.243/32 that point through some other interface, not through Gi3/1 and Gi0/0/2? We also have to account for the possibility that the unicast reachability between the 6509 and the ASR get imparied because these devices actually learn about a more specific route (which can only be a /32 host route in your case) to the neighbor that points out an unexpected interface.

Looking forward to hearing from you!

Best regards,
Peter

bobbydazzler · ‎01-03-2018

Hi Peter,

First of all, I wish you all the best for this new year.

Regarding the issue, I have been working on a workaround in order to get all our BGP up and running. I am now back to the OSFP adjencies problem. Whilst I was setting a new router to establish the BGP session (that doesn't work on the 6509), I set up an OSPF a new OSFP point to point session with the 6509 and it works straight away, however when I tried to upgrade the firmware of the new router, it went down in a similar way as it does with the ASR. Would that could give any clue regarding the pb?

So I still have a pb between the ASR and the 6509 to establish the OSPF session. I have taken the debug log and tcpdump (but only on the 6509 at the moment as I have no free port on the ASR to do a SPAN). Please find them attached.

Here is the OSPF configs on the 6509:

router ospf 65535
 router-id xxx.xxx.247.255
 log-adjacency-changes
 auto-cost reference-bandwidth 10000
 area 0 authentication message-digest
 area 65500 authentication message-digest
 area 65500 stub
 passive-interface default
 no passive-interface GigabitEthernet3/1
 no passive-interface GigabitEthernet3/3
 no passive-interface GigabitEthernet3/9
 no passive-interface GigabitEthernet3/14
 no passive-interface Vlan999
 network xxx.xxx.247.224 0.0.0.1 area 0
 network xxx.xxx.247.240 0.0.0.3 area 65500
 network xxx.xxx.247.244 0.0.0.1 area 0
 network xxx.xxx.247.246 0.0.0.1 area 0
 network xxx.xxx.247.255 0.0.0.0 area 0

ASR:

router ospf 65535
 router-id xxx.xxx.247.254
 interface-id snmp-if-index
 auto-cost reference-bandwidth 10000
 area 0 authentication message-digest
 area 65500 authentication message-digest
 area 65500 stub
 passive-interface default
 no passive-interface GigabitEthernet0/0/0
 no passive-interface GigabitEthernet0/0/2
 network xxx.xxx.247.239 0.0.0.0 area 65500
 network xxx.xxx.247.240 0.0.0.15 area 65500

I've tried to deactivate the the ip nat outside and netflow but problem remains.

Best regards

Peter Paluch · ‎01-04-2018

Hi Bob,

From the outputs and packet captures, I can see that the Cat6509 (x.x.247.242) sends DBD packets to the ASR (x.x.247.243), and the ASR is responding back - this is nicely seen in the snapshot from Wireshark, and also seen in the debug output from ASR (output filtered):

Jan  2 15:19:26.159: OSPF-65535 ADJ   Gi0/0/2: Rcv DBD from xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x7 len 32  mtu 1500 state EXSTART
Jan  2 15:19:26.159: OSPF-65535 ADJ   Gi0/0/2: Send DBD to xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x2 len 232
Jan  2 15:19:30.734: OSPF-65535 ADJ   Gi0/0/2: Rcv DBD from xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x7 len 32  mtu 1500 state EXCHANGE
Jan  2 15:19:30.734: OSPF-65535 ADJ   Gi0/0/2: Send DBD to xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x2 len 232
Jan  2 15:19:35.496: OSPF-65535 ADJ   Gi0/0/2: Rcv DBD from xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x7 len 32  mtu 1500 state EXCHANGE
Jan  2 15:19:35.496: OSPF-65535 ADJ   Gi0/0/2: Send DBD to xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x2 len 232
Jan  2 15:19:40.136: OSPF-65535 ADJ   Gi0/0/2: Rcv DBD from xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x7 len 32  mtu 1500 state EXCHANGE
Jan  2 15:19:40.136: OSPF-65535 ADJ   Gi0/0/2: Send DBD to xxx.xxx.247.255 seq 0x2AC opt 0x50 flag 0x2 len 232
[ ... cut ... ]

Note that the sequence number of the DBD packets received from the Cat6509 is constantly set to 0x2AC and never increases even though the Cat6509 is the Master in this DBD exchange, and so it should be increasing the DBD sequence number after receiving a response from the ASR. This is not happening, and the debug from the Cat6509 shows only retransmissions of the initial DBD packet:

Jan  2 15:19:26.173: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x2AC opt 0x50 flag 0x7 len 32
Jan  2 15:19:30.749: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x2AC opt 0x50 flag 0x7 len 32
Jan  2 15:19:30.749: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [1]
Jan  2 15:19:35.509: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x2AC opt 0x50 flag 0x7 len 32
Jan  2 15:19:35.509: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [2]
Jan  2 15:19:40.153: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x2AC opt 0x50 flag 0x7 len 32
Jan  2 15:19:40.153: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [3]
Jan  2 15:19:44.805: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x2AC opt 0x50 flag 0x7 len 32
Jan  2 15:19:44.805: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [4]
Jan  2 15:19:49.781: OSPF: Send DBD to xxx.xxx.247.254 on GigabitEthernet3/1 seq 0x2AC opt 0x50 flag 0x7 len 32
Jan  2 15:19:49.781: OSPF: Retransmitting DBD to xxx.xxx.247.254 on GigabitEthernet3/1 [5]
[ ... cut ... ]

The Cat6509 does not hear the DBD replies coming from the ASR - however, the Wireshark capture from the SPAN session you have done on the Cat6509 clearly shows that the DBD packets from the ASR are seen by the Cat6509. That means that they are actually making it all the way to the Cat6509, just - for some reason - they are not being seen by the OSPF process.

From this, I can see four most likely reasons for that:

The packets returning from the ASR are encapsulated into Ethernet frames whose addressing is somehow corrupt - their source MAC is not exactly the ASR's MAC on Gi0/0/2, or their destination MAC is not exactly the Cat6509's MAC on Gi3/1
The Cat6509 believes that the entire frames coming from ASR are corrupt, and drops them on Ethernet level
The IP driver on the Cat6509 is dropping the IP packets, believing they are corrupt for whatever reason. If the IP driver dropped the DBD packets from ASR, OSPF would never see them, so their lack in the debugs would be expected.
The hardware forwarding data is misprogrammed, causing the packets to forwarded somewhere else instead of punting them to the CPU supervisor.

The first hypothesis about corrupt addressing of the frames would be trivially visible in the Wireshark capture - please be so kind to cross-check the source and destination MAC addresses of the DBD messages sent from the ASR toward the Cat6509 with the MAC addresses of the appropriate interfaces (seen in, for example, show interface ...)

The second hypothesis would be verifiable by checking the output of show interface gi3/1 on the Cat6509, and observing any input error counters increasing.

To verify the third and fourth hypothesis, the easiest way would be to run debug ip packet 100 detail with the ACL 100 defined as follows:

access-list 100 permit ospf host x.x.247.243 x.x.247.242

We would then be looking for any clues telling us that the Ca6509 thinks the IP packets are wrong (supporting the hypothesis about suspected IP corruption), or there would be no output in the debug at all, supporting the hypothesis that the packets, while arriving at the Cat6509, are not hitting the supervisor after all.

Best regards,
Peter

bobbydazzler · ‎01-04-2018

Hello Peter,

Thanks again for this very interesting reply and for following up my case. Running the debug command on the ACL you mentioned. I actually don't see any unicast OSPF packets, only the multicast, even for the working OSPF neighbors with a catch all OSPF ACL:

# access-list 100 permit ospf any any

When I run the same command on the working neighbors, I see all the packets, in and out. Any idea why would the packets would not hit the supervisor or what could prevent the packets to reach the sup?

BTW, Today I had to reboot one of the router that had a working OSPF relationship with the 6509 and after reboot I have the same behavior as with the ASR :-(

Kind regards

Peter Paluch · ‎01-04-2018

Hi Bob,

For stable OSPF adjacencies in FULL state in absence of topology changes or updates, you won't see unicast OSPF packets - the routers only exchange Hellos. We should, however, be seeing the unicast packets between the ASR and the Cat6509, assuming that these two guys are still stuck in ExStart/Exchange state, as the exchange is unicast at that stage. The fact we don't see them is suspicious indeed.

Have you been able to exclude the first two hypotheses - is the MAC addressing correct? Are there no input errors on the interfaces of the Cat6509?

We need to verify whether the received OSPF unicast packets are being internally forwarded to the supervisor. What I would suggest is using the ELAM tool that allow you to capture a single packet based on certain matching criteria, and display the very detailed information about the forwarding decision made by the hardware. Performing ELAM is safe but it is a truly low-level tool and is not straightforward to use. I suggest going over the following three documents to get an overview what is the ELAM, and how to perform it on Catalyst 6500 series switches with either Sup720 or Sup2T:

https://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116648-technote-product-00.html
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116643-technote-product-00.html
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116644-technote-product-00.html

The goal would be to ELAM-capture the OSPF unicast packets from the ASR to the Cat6509, and check the destination LTL index which ultimately identifies the interface where the packet is forwarded. If it does not point toward the supervisor, we have a hardware misprogramming.

Best regards,
Peter

bobbydazzler · ‎01-04-2018

Okay, I didn't know about ELAM, interesting! I will go through the doc and try this out.

Yes MAC is fine and no errors on the 6509/ASR interfaces.

Initially I thought it could be a hardware problem so I did try to swap the 6509, including the line cards.

Best regards

bobbydazzler · ‎01-05-2018

Hello Peter,

I have run a couple of ELAM capture with the following superman trigger (IPs are from the 6509-6509 OSPF relationship that went down as well):

SCHT01#show platform capture elam status                      
Active ELAM info:
Slot Cpu   Asic   Inst Ver PB Elam
---- --- -------- ---- --- -- ----
5    0   SUPERMAN 0    1.3    Y    
DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=xxx.xxx.2.247.247 IP_DA=xxx.xxx.247.246 
Elam capture completed

The input INDEX is correct (VLAN 999) but I believe we are interested in the DEST_INDEX of the RBUS data that always point to: [19] = 0x7FFF which seems wrong:

SCHT01# remote command switch test mcast ltl-info index 7fff

index 0x7FFF contain ports * empty *

I tried several time as it capture only one packet I believe? The DEST INDEX is always the same, 0x7FFF, the L3_PT [8] = 89 which is OSPF, thus it seems that it catches the packet we are interested in. Attached one of the captures.

Best regards,

Bob

Peter Paluch · ‎01-05-2018

Hi Bob,

The destination LTL of 0x7FFF is a drop index - meaning the packets will be silently discarded. You can check well-known LTL values and ranges using the show platform software ltl well-known-index command.

The trick is now in understanding why the device decided to drop the packets. We could try digging some more into the ELAM, but that is becoming more and more a specialized TAC work.

A blind shot here: Do you by any chance utilize any kind of ingress policing, CoPP (Control Plane Policing), hardware limiters, or anything of the sort? Is it possible that the OSPF packets are getting killed by any kind of ingress CoPP/QoS policy?

Best regards,
Peter

bobbydazzler · ‎01-08-2018

Hello Peter,

You have pinpointed the reason why the OSPF is failing to establish the adjencies, thanks again for your help and detail replies. I have learnt a lot on debugging osfp pb in the process.
I need to find out why it is happening now. Must be something silly as usual.
Iam not doing any policing of any kind on the 6509, only switching and routing. Policing is done on ASR.
I will get back to this thread when I find out what is causing the 6509 to drops packets.

Best regards

OSPF Adjacency stuck in EXCHANGE/EX-START states