Re: Looking explanation on weird OSPF behavior

thomas-ravail · ‎11-24-2024

Hi Everyone,

A weird routing behavior happened between 2 routers and I'm looking now at some explanation to understand. We configured OSPF (broadcast) between 2 routers (R1 and R2) with a redistribution of a default route from BGP to OSPF on R1. R2 was advertising as well some routes.

For a project, we applied an access-list on R1, on the interface facing R2. This ACL contained, by mistake, the interconnection IP subnet used between R1 and R2 meaning Unicast packets were dropped from R2 to R1. This interconnection subnet should have been excluded from the ACL but that was a mistake. The OSPF neighboring didn't went down because the HELLO packets were sent via Multicast to 224.0.0.5. After 35hours, the neighboring was still UP but none of the OSPF routes were announced anymore (causing an incident)..

We performed a shut/no shut on one interface and the OSPF stayed DOWN (due to the ACL dropping the unicast packets). it went UP immediately after removing the ACL and the routes were exchanged.

--> We're now trying to understand why the OSPF routes disapeared from the routing table after this huge delay while the OSPF protocol was UP. When I reproduce it in GNS3, all HELLO packets as well as LS_update and LS_Acknowledge (when doing new advertisement) are in Multicast and not in Unicast..

Thanks for your help.

MHM Cisco World · ‎11-24-2024

it depend on your network type

thomas-ravail · ‎11-25-2024

Hi,

We are in broadcast mode.

MHM Cisco World · ‎11-25-2024

do you use neighbor command under ospf ?

can you share

show ip ospf inter brief <<- from real network

MHM

balaji.bandi · ‎11-24-2024

Can we see the ACL applied?

BB

=====Preenayamo Vasudevam=====

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

thomas-ravail · ‎11-25-2024

Hi,

The ACL applied is as follow :

deny ip any 192.168.0.0 0.0.255.255
permit ip any any

For security reason, I changed the subnet IP but this is the same story. In this global /16 DENY, we have the interconnection subnet in /24, configured between both routers.

paul driver · ‎11-25-2024

Hello
I believe what you are seeing is correct, the ospf adjacency was formed before any acl was applied, so the initial ip unicast connectivity required for DR/BDR election and adjacency to form wasn't being negated , as/when you've applied the acl and depending how its was applied (egress/ingress or both) and to what side of the peering (DR/BDR) then the MC hellos kept the OSPF adjacency alive and I envisage the routes were eventually withdrawn upon the MAX age of the LSAs being reached due to the acl restriction

As/When you've manually torn down the adjacency with the acl still applied then this is when the new adjacency should fail as the rtrs would not be able to complete the DR/BDR election and become stuck exchange/extstart state due the acl restriction of unicast

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

thomas-ravail · ‎11-25-2024

Hi Paul,

Yes, couldn't agree more with your analysis, but the delay between the time we applied the ACL and the incident is huge (35 hours)..

Joseph W. Doherty · ‎11-25-2024

Since you describe connecting network was a /24, and OSPF routers were not in p2p mode, I'm wondering if the "different" behavior you describe, might also have something to do with which router was DR and which was BDR, which might have changed/swapped between the two instances. (What I'm thinking of, is possibly, BDR is "happy" hearing from a DR, but DR doesn't need to hear from a BDR. I.e. if ACL blocks replies from BDR to DR, all okay, but if BDR cannot hear from DR, it should become DR.)

As far as routes disappearing, I agree with Paul that should be due to LS aging out (30 minutes?).

In the big scheme of things, whatever the "odd" behavior was you observed, blocking some OSPF routing protocol traffic, between routers, is very likely to cause some issues, and you may have just stumbled across a case OSPF doesn't well handle (as, without a misconfigured ACL, this would be rather unusual to happen in a "normally" working network - unnormal being just some kinds of traffic being blocked in just one direction - sort of a selective bidirectional traffic issue).

In other words, other than being curious, about why you had the different observed behaviors, you "know" you had a misapplied ACL, so identifying the specific "cause" of the observed behavior doesn't really help us much. I mean, how would you work around "fixing" or avoiding this problem beyond not misapplying an ACL?

I'm guessing you're interesting in the "cause" lies much in " After 35hours, the neighboring was still UP but none of the OSPF routes were announced anymore (causing an incident)..", i.e. it took some time to identify this issue because OSPF neighbor still has adjacency, correct?

thomas-ravail · ‎11-25-2024

Hi Joseph,

Thanks for your answer. You're right, we know how to fix this issue, just exclude the interconnection subnet in the ACL, but not be able to explain completely the incident is frustating..

An important information received today : During the incident, on R1 and R2, the OSPF status was UP but R1 was showing these logs:

Nov 21 06:14:04.115 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Nov 21 06:15:04.115 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Nov 21 06:17:16.242 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Nov 21 06:18:16.243 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Nov 21 06:20:22.467 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Nov 21 06:21:22.467 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Nov 21 06:23:35.654 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions

Based on these log outputs, the OSPF neighboring was trying to establish the adjacency but on both side, OSPF status was UP (have been confirmed many times during the incident).

Joseph W. Doherty · ‎11-25-2024

As Paul also mentioned, he expected adjacency to get stuck in exchange/exstart, which appears to be confirmed by your posted log entries.

What commands were you using that showed adjacency fully established while those log entries were generated?

BTW, possibly the point I was trying to make in my prior reply was unclear. Let's assume your reported behavior is exactly as you describe, and you confirm the behavior was "incorrect", but as the triggering cause was also incorrect, what is the goal? Again, the issue was adjacency appeared to be fully established when it wasn't? If so, it sounds like a potential TAC bug case.

You were unable to recreate issue in GNS?

MHM Cisco World · ‎11-26-2024

this LAB all traffic is send as multicast as you also see in your lab
and hence even with ACL OUTbound the OSPF is not down

but in your real network Peers use unicast for ACK and this drop by ACL OUT
this make OSPF retransmit the update it take 5 sec to retransmit and the retry numbers is 1-255
after this retry the OSPF know that there is problem and down the OSPF to neighbor
and it not 35 hr I think it 0.35 hr

MHM

paul driver · ‎11-26-2024

Hello @MHM Cisco World
this doesnt show the true issue of the OP.

I dont have access to test myself however I envisage you need to:

Allow the ospf adjacency to form as such the LSDB to be created and rib table populated with some routes
At that point, check the LSDB (lsa age etc...) THEN apply the acl as per OP and check the LSDB again and ospf adjacency, at that point it should NOT fail or drop

Lastly clear the ospf process or drop and interface, this should tear down the opsf process and as this point the ospf adjacency should then fail.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

MHM Cisco World · ‎11-26-2024

Apologize
your ack about the retransmit need to refresh

it down the OSPF after specific retry

Troubleshoot the "OSPF Neighbor Down: Too Many retransmissions" Error Message

MHM

paul driver · ‎11-26-2024

Hello @MHM Cisco World

@MHM Cisco World wrote:

your ack about the retransmit need to refresh

Not sure what you mean with the above comment I think you are on about the OP showing

Nov 21 06:14:04.115 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions

And then my comment about the applied ACL not being as restrictive so still allowing the LSAs to be refreshed upto the point the ospf adjacency was torn down - If so then my statement is in reference to an already stable ospf adjacency and then having the acl applied thereafter but BEFORE any additional ospf process being cleared at that point I envisaged you would see the above error being logged.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul