07-26-2023
12:15 AM
- last edited on
10-11-2023
02:24 AM
by
Translator
Hello Everyone,
Our customer has periodical outages because of OSPF flapping. The flap occures 2-3 times per day randomly. The topology is the following:
Checkpoint Firewall A - - - - - - - - - - - - - - - Vlan7 - - -- - - - - - - -- - Checkpoint Firewall B
| VRRP VIP: X.X.X.155 |
| |
| |
| Vlan7 |Vlan7
| |
| |
| X.X.X.156 X.X.X.157 |
Cisco Catalyst 4500 L3 Switch A - - - - -- - - Vlan7 - - - - - - --- - - Cisco Catalyst 4500 L3 Switch B
The OSPF flaps via Vlan7 only, and only between the Switches and the FWs. It is stable between the two switches. This is the log message:
*Jul 25 04:40:18.391 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from FULL to DOWN, Neighbor Down: Too many retransmissions
*Jul 25 04:41:18.391 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from DOWN to DOWN, Neighbor Down: Ignore timer expired
*Jul 25 04:41:32.016 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from LOADING to FULL, Loading Done
The FWs are using their VRRP VIP for OSPF. (X.X.X.155)
The Switch A is OSPF DR, Switch B is OSPF BDR.
I only have access to the switches.
The problem is there at least for 2 months
I did several packet captures and OSPF debugs. I noticed the following:
- Either the A or the B switch sends out a multicast LSU containing X, Y, Z LSA.
- The FW and the neighbor switch send an LSAck which contains X, Y, Z subnet
- One of the switches (regardless of which one has sent the original multicast LSU) starts to send unicast LSU containing X, Y, Z LSA to the FW.
- The FW is not ACKing these unwanted / unnecessary unicast LSUs, I guess they see it as malicious traffic
- After the switch has sent 25 unicast LSUs, and missed 25 LSAcks, it deletes the OSPF neighbor towards the FW.
config:
Switch-A#sh run int vl7
Building configuration...
Current configuration : 469 bytes
!
interface Vlan7
description ** VLAN 7 **
ip vrf forwarding INTERNAL
ip address X.X.X.156 255.255.255.224
no ip redirects
no ip proxy-arp
standby 37 ip X.X.X.158
standby 37 priority 105
standby 37 preempt
standby 37 authentication md5 (...)
ip ospf authentication message-digest
ip ospf message-digest-key 5 md5 7 (...)
ip ospf priority 255
ip ospf lls disable
ip ospf bfd disable
load-interval 30
end
Switch-A#
Switch-A#sh run | s router ospf
router ospf 65138 vrf INTERNAL
router-id X.X.X.35
auto-cost reference-bandwidth 10000
redistribute connected subnets route-map CONNECTED-TO-OSPF-INTERNAL
redistribute static subnets route-map STATIC-TO-OSPF-INTERNAL
redistribute bgp (...) subnets route-map BGP-TO-OSPF-INTERNAL
passive-interface default
no passive-interface Vlan7
network X.X.X.128 0.0.0.31 area 0
distribute-list route-map (...) in
(...)
Switch-A#
Switch-B#sh run int vl7
Building configuration...
Current configuration : 444 bytes
!
interface Vlan7
description ** VLAN 7 **
ip vrf forwarding INTERNAL
ip address X.X.X.157 255.255.255.224
no ip redirects
no ip proxy-arp
standby 37 ip X.X.X.158
standby 37 preempt
standby 37 authentication md5 (...)
ip ospf authentication message-digest
ip ospf message-digest-key 5 md5 7 (...)
ip ospf priority 254
ip ospf lls disable
ip ospf bfd disable
load-interval 30
end
Switch-B#sh run | s router ospf
router ospf 65138 vrf INTERNAL
router-id X.X.X.36
auto-cost reference-bandwidth 10000
redistribute connected subnets route-map CONNECTED-TO-OSPF-INTERNAL
redistribute static subnets route-map STATIC-TO-OSPF-INTERNAL
redistribute bgp (...) subnets route-map BGP-TO-OSPF-INTERNAL
passive-interface default
no passive-interface Vlan7
network X.X.X.128 0.0.0.31 area 0
distribute-list route-map (...) in
(...)
Switch-B#
What I already ruled out:
- there is no mac-move happening on the switches
- STP topology changes are happening much less frequently (1x per 2-3 weeks)
- there is no OSPF checksum error in the
show IP
traffic output
- we disabled LLS
- we disabled BFD (on the FWs, too)
- CPU utilization is normal
- no errors on the interfaces
- there is no MTU mismatch between the switches - FWs
- The switches and FWs were already rebooted, and we did OS upgrade on every device
So it seems like to me that when the issue happens, one the switches is unable to process the LSAck for some reason.
Do you have any idea / suggestion? If you need I can share the pcap / OSPF debug outputs, too.
Thanks in advance.
Solved! Go to Solution.
08-30-2023 04:23 PM
@David Samuel Penaloza Seijas ,
Bráško... No words needed (nor coming readily to me) here. But you know what I'd want to say.
Thank you!
Best regards,
Peter
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide