cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5352
Views
12
Helpful
30
Replies

OSPF Flap

Gergely Racz
Level 1
Level 1

Hello Everyone,

Our customer has periodical outages because of OSPF flapping. The flap occures 2-3 times per day randomly. The topology is the following:



Checkpoint Firewall A - - - - - - - - - - - - - - - Vlan7 - - -- - - - - - - -- - Checkpoint Firewall B
|                                                               VRRP VIP:  X.X.X.155                                          |
|                                                                                                                                          |
|                                                                                                                                          |
| Vlan7                                                                                                                                 |Vlan7
|                                                                                                                                          |
|                                                                                                                                          | 
|              X.X.X.156                                                                           X.X.X.157                  |
Cisco Catalyst 4500 L3 Switch A - - - - -- - - Vlan7 - - - - - - --- - - Cisco Catalyst 4500 L3 Switch B

The OSPF flaps via Vlan7 only, and only between the Switches and the FWs. It is stable between the two switches. This is the log message:


*Jul 25 04:40:18.391 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from FULL to DOWN, Neighbor Down: Too many retransmissions
*Jul 25 04:41:18.391 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from DOWN to DOWN, Neighbor Down: Ignore timer expired
*Jul 25 04:41:32.016 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from LOADING to FULL, Loading Done


The FWs are using their VRRP VIP for OSPF. (X.X.X.155)
The Switch A is OSPF DR, Switch B is OSPF BDR.
I only have access to the switches.
The problem is there at least for 2 months
I did several packet captures and OSPF debugs. I noticed the following:

- Either the A or the B switch sends out a multicast LSU containing X, Y, Z LSA.
- The FW and the neighbor switch send an LSAck which contains X, Y, Z subnet
- One of the switches (regardless of which one has sent the original multicast LSU) starts to send unicast LSU containing X, Y, Z LSA to the FW.
- The FW is not ACKing these unwanted / unnecessary unicast LSUs, I guess they see it as malicious traffic
- After the switch has sent 25 unicast LSUs, and missed 25 LSAcks, it deletes the OSPF neighbor towards the FW.


config:

Switch-A#sh run int vl7
Building configuration...

Current configuration : 469 bytes
!
interface Vlan7
description ** VLAN 7 **
ip vrf forwarding INTERNAL
ip address X.X.X.156 255.255.255.224
no ip redirects
no ip proxy-arp
standby 37 ip X.X.X.158
standby 37 priority 105
standby 37 preempt
standby 37 authentication md5 (...)
ip ospf authentication message-digest
ip ospf message-digest-key 5 md5 7 (...)
ip ospf priority 255
ip ospf lls disable
ip ospf bfd disable
load-interval 30
end

Switch-A#
Switch-A#sh run | s router ospf
router ospf 65138 vrf INTERNAL
router-id X.X.X.35
auto-cost reference-bandwidth 10000
redistribute connected subnets route-map CONNECTED-TO-OSPF-INTERNAL
redistribute static subnets route-map STATIC-TO-OSPF-INTERNAL
redistribute bgp (...) subnets route-map BGP-TO-OSPF-INTERNAL
passive-interface default
no passive-interface Vlan7
network X.X.X.128 0.0.0.31 area 0
distribute-list route-map (...) in
(...)
Switch-A#










Switch-B#sh run int vl7
Building configuration...

Current configuration : 444 bytes
!
interface Vlan7
description ** VLAN 7 **
ip vrf forwarding INTERNAL
ip address X.X.X.157 255.255.255.224
no ip redirects
no ip proxy-arp
standby 37 ip X.X.X.158
standby 37 preempt
standby 37 authentication md5 (...)
ip ospf authentication message-digest
ip ospf message-digest-key 5 md5 7 (...)
ip ospf priority 254
ip ospf lls disable
ip ospf bfd disable
load-interval 30
end

Switch-B#sh run | s router ospf
router ospf 65138 vrf INTERNAL
router-id X.X.X.36
auto-cost reference-bandwidth 10000
redistribute connected subnets route-map CONNECTED-TO-OSPF-INTERNAL
redistribute static subnets route-map STATIC-TO-OSPF-INTERNAL
redistribute bgp (...) subnets route-map BGP-TO-OSPF-INTERNAL
passive-interface default
no passive-interface Vlan7
network X.X.X.128 0.0.0.31 area 0
distribute-list route-map (...) in
(...)
Switch-B#

 

What I already ruled out:
- there is no mac-move happening on the switches
- STP topology changes are happening much less frequently (1x per 2-3 weeks)
- there is no OSPF checksum error in the

show IP

traffic output
- we disabled LLS
- we disabled BFD (on the FWs, too)
- CPU utilization is normal
- no errors on the interfaces
- there is no MTU mismatch between the switches - FWs
- The switches and FWs were already rebooted, and we did OS upgrade on every device


So it seems like to me that when the issue happens, one the switches is unable to process the LSAck for some reason.

Do you have any idea / suggestion? If you need I can share the pcap / OSPF debug outputs, too.


Thanks in advance.

1 Accepted Solution

Accepted Solutions

Peter Paluch
Cisco Employee
Cisco Employee

Hi everyone,

With Gergely, we've been communicating directly and I got also looped into the TAC case that was raised for this issue, leading the investigation. Following Gergely's permission, I'd like to share the explanation we arrived to.

The root cause of the problem was the fact that the input traffic rules on the Checkpoint firewall were misconfigured - they did not allow unicast OSPF LSU packets (retransmissions) to be accepted. Hence, when either of the Catalyst switches started sending unicast LSU retransmissions to the firewall, it would silently drop them, causing the switch to give up after 25 retransmissions and drop the adjacency.

However, the reason why the Catalyst switches engaged into the unicast retransmissions was much more involved. Analyzing the packet captures performed on the Catalyst switch supervisors (using the "monitor capture ... control-plane both"), we could see that the firewall successfully acknowledged all LSAs flooded from the switches as multicasts, and the corresponding LSAck packets from the firewall were all received by both switches properly. There was no apparent reason for the switches to start the unicast retransmissions because all LSAs flooded from the switches were acknowledged by the firewall and no LSAck packets were lost. The majority of our investigation focused on explaining these apparently unjustified retransmissions initiated at random moments by either of the Catalyst switches.

Eventually, the behavior of the switches was identified as expected. In 2011, tracked in CSCtu25818 for OSPFv2 and CSCtx15575 for OSPFv3, Cisco implemented a scalability enhancement to the IOS OSPF implementation to prevent retransmissions caused by late processing of LSAck packets. The gist of this enhancement is to have the IOS OSPF router process put the incoming unprocessed non-hello OSPF packets into two software queues: LSAcks go into one queue (called high priority queue) while DBDs, LSRs and LSUs go into the other queue (called low priority queue). Whenever the IOS OSPF process dequeues packets, it always first dequeues packets from the high priority queue that exclusively holds LSAcks. Only if the high priority queue is empty, the IOS OSPF process dequeues the low priority queue. Essentially, for the non-hello packets, the IOS OSPF router process always processes incoming LSAcks first, and only processes other incoming packets if there are no more LSAcks to process. This is not just a proprietary quirk: RFC 4222 specifically endorses prioritizing the processing of incoming Hello and LSAck packets over the rest of OSPF packets.

This behavior can lead to an interesting scenario which was indeed occurring in Gergely's customer's network:

- The switch A (DR or BDR) bursts out a large amount of LSUs to 224.0.0.5

- Both the switch B (BDR or DR) and the firewall receive these LSUs.

- The firewall promptly sends back a number of LSAcks to 224.0.0.6, acknowledging the LSAs received in the LSUs.

- Switch B may have processed a few of the received LSUs in the meantime, but after receiving the burst of LSAcks from the firewall, it will process these first as they are placed into the high priority queue. Only after there are no more LSAcks to process, it will continue processing the remaining LSUs from the low priority queue.

- As a result, some of the LSAs received and acknowledged by the firewall are not yet installed in the LSDB of switch B and put on the retransmit queues for OSPF neighbors when the LSAcks from the firewall are processed; these LSAs are still waiting in the low priority queue on switch B.

- Consequently, when switch B processes LSAcks sent by the firewall, LSAs that already are in the LSDB on switch B will be removed from the retransmit list for the firewall if they are on the list; for the rest of the acknowledged LSAs that are not in the switch B's LSDB, switch B does nothing.

- After switch B processes all received LSAcks, it returns to the low priority queue and processes the remaining received LSUs. The LSAs in these LSUs will be installed into the LSDB and put on the retransmit list for the firewall as usual (but not flooded to the firewall again because switch A as the DR/BDR did the flooding already). As a result of putting these LSAs on the retransmit list, switch B will wait for acknowledgements from the firewall – but the firewall already sent them and so has no reason to send them again; it does not even know that switch B waits for the acknowledgements.

- Consequently, switch B will eventually start the unicast retransmissions toward the firewall – and since the firewall drops the unicast OSPF retransmissions, switch B will eventually give up and drop the adjacency.

This completely explains the reason why either of the Catalyst switches - based on event timing - started the unicast retransmissions despite the firewall acknowledging all received LSAs. Correcting the input rules on the firewall to permit incoming unicast OSPF LSU packets resolved the issue, and we know that the retransmissions from the Catalyst switches are justified and not a result of a misbehavior.

As a side note, as we had to deal with large amounts of debug outputs, the following configuration was very useful on the Catalyst switches as it

1) prepends every logging message with a monotonously increasing sequence number allowing to identify any lost or rate-limited logging message

2) enables persistent logging of system messages into files in the bootflash:

3) increases the size of the internal memory buffer to hold logging messages before they are sent to their individual logging destinations to prevent losing logging messages due to overfilling this internal buffer

4) increases the rate limit for the logging messages to prevent losing logging messages due to rate limiting

configure terminal
service sequence-numbers
logging persistent filesize 1048576 immediate url bootflash:/logs
logging queue-limit 100000
logging rate-limit 7000
end

In fact, at the beginning of our investigation, we were losing debugging messages in the logs and that has misled me into believing that the Catalyst switches were incompletely processing received LSAcks. In reality, the received LSAcks were always processed completely but the debug messages were rate-limited which made the impression that the processing of certain LSAcks was only partial. That was a red herring.

Hopefully this will be of interest and help to anyone who might be facing similar issues or spurious apparently inexplicable retransmissions in IOS/IOS XE OSPF.

Best regards,
Peter

 

View solution in original post

30 Replies 30

Hi @Gergely Racz 

 Did you run

debug ip ospf adj  and debug ip ospf packets

?

"- there is no MTU mismatch between the switches - FWs"

Can you confirm which is the MTU size?

"- The FW is not ACKing these unwanted / unnecessary unicast LSUs, I guess they see it as malicious traffic"

I know you dont have access to the firewall but consider the possibility of look at the firewall because there is the possibility of the problem be the firewall as the switch is complaning of too many retransmission.

 

Hi @Flavio Miranda 


Yes, I was running those debugs. I did not see any reason why did the switch start to send unicast LSUs. There was no error messages like MTU mismatch, area issue, duplicate router ID, auth mismatch, etc. But let me copy what I saw starting with the original multicast LSU:

 

*Jul 18 18:52:38.273 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1000 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.274 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.275 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:38.275 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:744 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.067 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.101 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.134 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.166 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.197 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.227 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.262 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.273 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:1 len:52 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.292 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1000 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.519 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.520 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.520 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.520 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.520 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.520 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.520 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.520 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.521 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.521 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.522 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.522 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.719 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.719 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.719 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.720 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.720 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.720 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.720 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.720 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.721 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.721 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.721 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.722 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:39.919 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:39.919 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:984 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.089 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.123 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.159 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.187 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.222 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.252 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.285 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.316 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1000 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.583 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:40.583 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:1 len:52 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:1464 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:40.609 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.6: ver:2 type:5 len:504 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.115 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.149 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.188 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.221 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.256 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.291 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.325 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:41.356 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1000 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.139 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.176 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.207 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.241 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.277 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.308 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:1432 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.342 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:4 len:352 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.590 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.590 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.590 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.590 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.590 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.591 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.591 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.591 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.592 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.592 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.592 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.593 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.788 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.788 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.788 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.788 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.788 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.788 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:42.788 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.788 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.789 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.789 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.789 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1444 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:42.789 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:5 len:1264 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:52:43.252 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:52:43.252 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->X.X.X.155: ver:2 type:4 len:1000 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
(...)
25 such unicast LSUs have been sent in the next 2 minutes
(...)
*Jul 18 18:54:32.062 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->X.X.X.155: ver:2 type:4 len:1000 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:54:34.288 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.157->224.0.0.5: ver:2 type:1 len:52 rid:X.X.X.36 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:54:34.907 CET: OSPF-65138 PAK : Vl7: IN: X.X.X.155->224.0.0.5: ver:2 type:1 len:52 rid:X.X.X.155 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:54:35.319 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:54:35.319 CET: OSPF-65138 PAK : Vl7: OUT: X.X.X.156->224.0.0.5: ver:2 type:1 len:52 rid:X.X.X.35 area:0.0.0.0 chksum:0 auth:2 keyid:5 seq:0x64B6
*Jul 18 18:54:36.813 CET: OSPF-65138 ADJ Vl7: Send with youngest Key 5
*Jul 18 18:54:36.813 CET: OSPF-65138 ADJ Vl7: Killing nbr X.X.X.155 due to excessive (25) retransmissions
*Jul 18 18:54:36.813 CET: OSPF-65138 ADJ Vl7: X.X.X.155 address X.X.X.155 is dead, state DOWN
*Jul 18 18:54:36.813 CET: %OSPF-5-ADJCHG: Process 65138, Nbr X.X.X.155 on Vlan7 from FULL to DOWN, Neighbor Down: Too many retransmissions
*Jul 18 18:54:36.813 CET: OSPF-65138 ADJ Vl7: Neighbor change event

 

Of course here we cannot check the content of the LSUs. But in the packet captures I saw that both the neighbor switch and the FW ACKed the original multicas LSU. So the ACK packet contained all the LSAs of the original multicast LSU.

This debug output rather proves that there was nothing particular going on in the OSPF process. (no errors, no new neighbors, no DR / BDR election, etc)

However it is important to mention that there was a big load of new subnets advertised into OSPF, this is why there are so many type 4 LSUs in the debug output. Whenever this OSPF flap happened, I saw a big number of LSU in the OSPF debug / pcap. But still, that should not be a problem, since the FW ACKed those. And I think that is reasonable that there are many LSUs visible from time to time, as we are redistributing BGP into the OSPF process.

 

The MTU is 1500 B on every device.

 

"I know you dont have access to the firewall but consider the possibility of look at the firewall because there is the possibility of the problem be the firewall as the switch is complaning of too many retransmission."

Checkpoint TAC was already opened, they say that there is no problem with the FW of course. They shared with me OSPF debugs when the OSPF flap occurred, we saw that the original multicast LSU arrived, and the FW sent out the LSAck. This is what we see in the switch pcap / debug, too. Then the unicast LSUs sent by the switch were missing. I think the FW discarded them before those packets could have been arrived into the OSPF process. I guess there is something like uRPF configured on the FW.

It is also important to mention that after the OSPF flap, when the OSPF neighborship is building up again, there are many OSPF DB Description packets getting exchanged by the swithces and FW, which are also unicast. Those packets are accepted, and processed by the FW. This is why i think those unwanted type 4 LSUs are seen as malicious packets by the FW.

 

But the main question here: why is the switch sending unicast LSUs to the FW, when the FW has ACKed those LSAs already?

 

 

 

 

Gergely, Flavio,

Please allow me to join.

It will be interesting to see those debugs as Flavio suggested.

Gergely, do you think you could share some PCAPs too? Here's what bothers me about this: If one of the switches starts sending unicast LSUs to the firewall, it is an indication of a retransmission so this would mean that indeed, the switch did not hear back a proper LSAck to the initial multicast LSU. While the firewall should accept unicast LSUs without any issues, let's keep this fact aside for a while, and let's try to find out why the LSAck from the firewall for the initial multicast LSU was not accepted.

Oh, one more thing: Have you checked for any CoPP drops? Not just OSPF but also others, particularly ARP.

Best regards,
Peter

 

show ip ospf neighbor 

share this for both SW 

Gergely Racz
Level 1
Level 1

Hello Everyone,

 

Let me try to answer your questions one by one:

CoPP:
There is no CoPP config on the switches. Should I try to configure something similar like this?
https://study-ccnp.com/cisco-copp-control-plane-policing-configuration/ 

Pcap:
(i hope you will be able to open them, right click, open image in new tab, then zoom)

Original multicast LSU containing 39 LSAs:

 

multicast_LSU.png

 

Two LSAcks from the FW containing all the 39 LSAs:

 

LSAck.png

 

And finally the unicast LSU containing the original 39 LSAs:

multicast_LSU.png

 

All of these pcaps were taken on the

Switch-B. Switch-A

pcap is correlating with it.

This behaviour is also strange, because the original multicast LSU was sent by the

Switch-A (X.X.X.156)

But the unicast was sent by the

Switch-B (X.X.X.157)

It is even stranger because the

Switch-A

is the DR.

Once again, it is random which switch starts to send the unicast LSUs. 
The frequency of the OSPF flap is also random. Sometimes it is 20 hours, sometimes it is 20 minutes.

 



OSPF neighbors:
(sorry the output gets compressed here)

Switch-A#sh ip ospf nei

Neighbor ID Pri State Dead Time Address Interface
(omitted)
(omitted)
X.X.X.36 254 FULL/BDR 00:00:37 X.X.X.157 Vlan7                  --> neighbor switch
X.X.X.155 0 FULL/DROTHER 00:00:30 X.X.X.155 Vlan7           --> FW
Switch-A#
Switch-A#sh ip ospf int brie
Interface PID Area IP Address/Mask Cost State Nbrs F/C
(omitted)
Vl7 65138 0 X.X.X.156/27 10 DR 2/2
Switch-A#




Switch-B#sh ip ospf nei

Neighbor ID Pri State Dead Time Address Interface
(omitted)
(omitted)
X.X.X.35 255 FULL/DR 00:00:37 X.X.X.156 Vlan7                    --> neighbor switch
X.X.X.155 0 FULL/DROTHER 00:00:30 X.X.X.155 Vlan7           --> FW
Switch-B#
Switch-B#sh ip ospf int brie
Interface PID Area IP Address/Mask Cost State Nbrs F/C
(omitted)
Vl7 65138 0 X.X.X.157/27 10 BDR 2/2
Switch-B#



Thanks for your help, I really appreciate it!

 

 

@Gergely Racz  maybe I am going way too far but did you consider to use another  routing protocol just to make sure you get different result?

No we did not think about it, but im 99,9% sure our customer would not approve. Also we would have to change the config of a lot of device, not just the switches / FWs.

How two FW have same IP? If it active/standby only active will participate in ospf.

Are you sure this check point HA config correct?

in OSFP using same IP (even if it VIP) for two different L3 device can make this issue. 
solution is change the OSPF and using static route toward VIP of FW 
or 
using direct neighbor between SW and FW (this solution I dont try before) and use interface real IP not VIP of of VRRP 

The FWs are using their VRRP IP only for OSPF. Most of the time this setup is working. Even when they share DB Description unicast packets.
Also we have other locations with similar setup, where we have two FWs, and they use their VRRP VIP for OSPF, too.

Then it easy to you check other locations' 

See how fw config for vrrp and ospf.

Yes, I've checked the other locations, we have similar setup there, two 4500 Catalysts with two Checkpoint firewalls, same OSPF config, FWs are using their VRRP VIP, etc.

 

These locations are regional HUB sites of our customer. Every HUB site follows this logic:
1. traffic comes into the global VRF through static / BGP
2. traffic gets redistributed into global OSPF
3. traffic goes to the Checkpoint FW via OSPF via Vlan 16 (which is in global)
4. traffic comes back to the switch from FW but now in Vlan 7 (which is configured under VRF INTERNAL)
5. traffic gets routed further within VRF INTERNAL

This means we have two OSPF processes, one in the global, another in VRF INTERNAL. These processes contain the same LSAs.

Now I've checked the number of LSAs in two other HUB site:


Problematic HUB site:
OSPF global            --> 7600 LSAs
OSPF INTERNAL     --> 7600 LSAs


Working HUB sites:
1. site:
OSPF global            --> 6100 LSAs
OSPF INTERNAL     --> 6100 LSAs

2. site:
OSPF global            --> 4200 LSAs
OSPF INTERNAL     --> 4200 LSAs



What do you think is it possible that in the problematic HUB site the OSPF protocol is overwhelmed by the huge number of LSAs and it simply unable to process an LSAck from time to time?

Now I check cluster vrrp of checkpoint' as I mention and I am right only the master have full database of ospf' other members of group not have.

Checkpoint have op. Ospf graceful restart which allow mater to sync ospf db to other members. Check if this op. Enable in other site.

Now why there is many lsu between SW and two checkpoint? Here check if mater backup is stable or it flapping which make backup ask las update from SW's

I asked our FW team to check this already, they said that the FW cluster is stable, there was no failover since reboot:

 

FW-A> show vrrp summary

VRRP State
VRRP Router State: Up
Flags: On,MonitorFirewall
Coldstart delay (240 seconds): completed
Interface enabled: 4
Virtual routers configured: 4
In Init state 0
In Backup state 0
In Master state 4
FW-A> show uptime
16 days 1 hour 5 minutes
FW-A> show cluster failover

Cluster failover count:
Failover counter: 0
Time of counter reset: Sun Jul 2 14:01:47 2023 (reboot)


Cluster failover history (last 20 failovers since reboot/reset on Sun Jul 2 14:01:47 2023):

No. Time: Transition: CPU: Reason:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

No failover was detected since last reboot/reset

 

 

To the question "why do we have so many LSA"? 

Because we redistribute them into OSPF