11-24-2024 09:15 AM
Hi Everyone,
A weird routing behavior happened between 2 routers and I'm looking now at some explanation to understand. We configured OSPF (broadcast) between 2 routers (R1 and R2) with a redistribution of a default route from BGP to OSPF on R1. R2 was advertising as well some routes.
For a project, we applied an access-list on R1, on the interface facing R2. This ACL contained, by mistake, the interconnection IP subnet used between R1 and R2 meaning Unicast packets were dropped from R2 to R1. This interconnection subnet should have been excluded from the ACL but that was a mistake. The OSPF neighboring didn't went down because the HELLO packets were sent via Multicast to 224.0.0.5. After 35hours, the neighboring was still UP but none of the OSPF routes were announced anymore (causing an incident)..
We performed a shut/no shut on one interface and the OSPF stayed DOWN (due to the ACL dropping the unicast packets). it went UP immediately after removing the ACL and the routes were exchanged.
--> We're now trying to understand why the OSPF routes disapeared from the routing table after this huge delay while the OSPF protocol was UP. When I reproduce it in GNS3, all HELLO packets as well as LS_update and LS_Acknowledge (when doing new advertisement) are in Multicast and not in Unicast..
Thanks for your help.
11-24-2024 09:37 AM
it depend on your network type
11-25-2024 05:58 AM
Hi,
We are in broadcast mode.
11-25-2024 06:07 AM - edited 11-25-2024 06:15 AM
do you use neighbor command under ospf ?
can you share
show ip ospf inter brief <<- from real network
MHM
11-24-2024 10:37 AM
Can we see the ACL applied?
11-25-2024 06:01 AM
Hi,
The ACL applied is as follow :
deny ip any 192.168.0.0 0.0.255.255
permit ip any any
For security reason, I changed the subnet IP but this is the same story. In this global /16 DENY, we have the interconnection subnet in /24, configured between both routers.
11-25-2024 09:19 AM
Hello
I believe what you are seeing is correct, the ospf adjacency was formed before any acl was applied, so the initial ip unicast connectivity required for DR/BDR election and adjacency to form wasn't being negated , as/when you've applied the acl and depending how its was applied (egress/ingress or both) and to what side of the peering (DR/BDR) then the MC hellos kept the OSPF adjacency alive and I envisage the routes were eventually withdrawn upon the MAX age of the LSAs being reached due to the acl restriction
As/When you've manually torn down the adjacency with the acl still applied then this is when the new adjacency should fail as the rtrs would not be able to complete the DR/BDR election and become stuck exchange/extstart state due the acl restriction of unicast
11-25-2024 02:57 PM
Hi Paul,
Yes, couldn't agree more with your analysis, but the delay between the time we applied the ACL and the incident is huge (35 hours)..
11-25-2024 10:17 AM
Since you describe connecting network was a /24, and OSPF routers were not in p2p mode, I'm wondering if the "different" behavior you describe, might also have something to do with which router was DR and which was BDR, which might have changed/swapped between the two instances. (What I'm thinking of, is possibly, BDR is "happy" hearing from a DR, but DR doesn't need to hear from a BDR. I.e. if ACL blocks replies from BDR to DR, all okay, but if BDR cannot hear from DR, it should become DR.)
As far as routes disappearing, I agree with Paul that should be due to LS aging out (30 minutes?).
In the big scheme of things, whatever the "odd" behavior was you observed, blocking some OSPF routing protocol traffic, between routers, is very likely to cause some issues, and you may have just stumbled across a case OSPF doesn't well handle (as, without a misconfigured ACL, this would be rather unusual to happen in a "normally" working network - unnormal being just some kinds of traffic being blocked in just one direction - sort of a selective bidirectional traffic issue).
In other words, other than being curious, about why you had the different observed behaviors, you "know" you had a misapplied ACL, so identifying the specific "cause" of the observed behavior doesn't really help us much. I mean, how would you work around "fixing" or avoiding this problem beyond not misapplying an ACL?
I'm guessing you're interesting in the "cause" lies much in " After 35hours, the neighboring was still UP but none of the OSPF routes were announced anymore (causing an incident)..", i.e. it took some time to identify this issue because OSPF neighbor still has adjacency, correct?
11-25-2024 02:56 PM
Hi Joseph,
Thanks for your answer. You're right, we know how to fix this issue, just exclude the interconnection subnet in the ACL, but not be able to explain completely the incident is frustating..
An important information received today : During the incident, on R1 and R2, the OSPF status was UP but R1 was showing these logs:
Nov 21 06:14:04.115 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Nov 21 06:15:04.115 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Nov 21 06:17:16.242 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Nov 21 06:18:16.243 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Nov 21 06:20:22.467 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Nov 21 06:21:22.467 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from DOWN to DOWN, Neighbor Down: Ignore timer expired
Nov 21 06:23:35.654 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
Based on these log outputs, the OSPF neighboring was trying to establish the adjacency but on both side, OSPF status was UP (have been confirmed many times during the incident).
11-25-2024 06:37 PM
As Paul also mentioned, he expected adjacency to get stuck in exchange/exstart, which appears to be confirmed by your posted log entries.
What commands were you using that showed adjacency fully established while those log entries were generated?
BTW, possibly the point I was trying to make in my prior reply was unclear. Let's assume your reported behavior is exactly as you describe, and you confirm the behavior was "incorrect", but as the triggering cause was also incorrect, what is the goal? Again, the issue was adjacency appeared to be fully established when it wasn't? If so, it sounds like a potential TAC bug case.
You were unable to recreate issue in GNS?
11-26-2024 01:25 AM - edited 11-26-2024 01:56 AM
this LAB all traffic is send as multicast as you also see in your lab
and hence even with ACL OUTbound the OSPF is not down
but in your real network Peers use unicast for ACK and this drop by ACL OUT
this make OSPF retransmit the update it take 5 sec to retransmit and the retry numbers is 1-255
after this retry the OSPF know that there is problem and down the OSPF to neighbor
and it not 35 hr I think it 0.35 hr
MHM
11-26-2024 01:38 AM
Hello @MHM Cisco World
this doesnt show the true issue of the OP.
I dont have access to test myself however I envisage you need to:
Allow the ospf adjacency to form as such the LSDB to be created and rib table populated with some routes
At that point, check the LSDB (lsa age etc...) THEN apply the acl as per OP and check the LSDB again and ospf adjacency, at that point it should NOT fail or drop
Lastly clear the ospf process or drop and interface, this should tear down the opsf process and as this point the ospf adjacency should then fail.
11-26-2024 01:40 AM
Apologize
your ack about the retransmit need to refresh
it down the OSPF after specific retry
Troubleshoot the "OSPF Neighbor Down: Too Many retransmissions" Error Message
MHM
11-26-2024 01:53 AM - edited 11-26-2024 01:54 AM
Hello @MHM Cisco World
@MHM Cisco World wrote:
your ack about the retransmit need to refresh
Not sure what you mean with the above comment I think you are on about the OP showing
Nov 21 06:14:04.115 GMT: %OSPF-5-ADJCHG: Process 100, Nbr X.X.X.X on GigabitEthernet0/0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
And then my comment about the applied ACL not being as restrictive so still allowing the LSAs to be refreshed upto the point the ospf adjacency was torn down - If so then my statement is in reference to an already stable ospf adjacency and then having the acl applied thereafter but BEFORE any additional ospf process being cleared at that point I envisaged you would see the above error being logged.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide