cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3437
Views
15
Helpful
25
Replies

Does DR/BDR role affect OSPF convergence time?

lionell01
Level 1
Level 1

I've attached a LLD of my design. The ASRs have eBGP neighborship with WAN side. In the LAN, both ASRs and the firewall are in the same OSPF broadcast network. The L2 switch in between is a 6800 in VSS. Both ASRs learn the same customer routes from eBGP and redistribute into OSPF but ASR2 redistributes with a higher metric.

Our OSPF timers are hello - 3 and dead - 9

updating that we have OSPF priorities set - FW - 50 , ASR1- 30 , ASR2 - 20

We had a planned failover activity wherein both LAN cables on ASR 1 were pulled. We expected failover to happen in 9 seconds. However, it took 13 seconds for the route on the firewall to move from ASR 1 to ASR2. Did the firewall being the DR result in more convergence time? Would it have helped if ASR1 was the DR? How does convergence work in broadcast OSPF network in detail would really like to understand?

 

1 Accepted Solution

Accepted Solutions

That debug information is very interesting.

Indeed, "big" delay waiting for ACK from ASR2 (about 2.502 seconds).  However, not really much delay in running SPF calculations (which appears to have been done in about a millisecond), but there's another "big" multi-second delay (2.503 seconds) in starting their recalculation.

Don't know about the FW's OSPF implementation, but Cisco's OSPF implementations include LSA delay timers and SPF throttling timers to avoid causing OSPF meltdowns due to too frequent OSFP calculations.  (NB: Cisco's documents these OSPF [and EIGRP] timers, and how to adjust, when your want to decrease convergence time.)  Perhaps FW has something similar for its OSPF implementation?

BTW, sort of interesting both delays were just about 2.5 seconds.  The two to three "extra" millisecond might have been actual processing delays.  The combined 5 seconds also appears to account for the bulk of your unexpected delay.

View solution in original post

25 Replies 25

balaji.bandi
Hall of Fame
Hall of Fame

we need to see the configuration also, you mentioned you pulled the cables from ASR1 (which one is ASR1 in your config ?)

how about external side - is the eBGP still up ? (since you pulled only Lan side rigt ?)

VSS switch act as Layer 2 ? you have OSPF neighbourship with FW right ? what FW ? what kind of timers you have on FW ?

 

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

lionell01
Level 1
Level 1

Hi Balaji,

we need to see the configuration also - please mention which outputs exactly? will modify and share

which one is ASR1 in your config ? the upper router

how about external side - is the eBGP still up ? (since you pulled only Lan side rigt ?) Yes eBGP was still up. We're only dealing with LAN side convergence here though.

VSS switch act as Layer 2 ? that's not it's only purpose, of course. It's Layer 2 for this setup specifically.

you have OSPF neighbourship with FW right ? what FW ? what kind of timers you have on FW ? Yes, it's a PA-3250. Which timers specifically?

Timer should match to to detect as dead. see palo settings :

https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClsKCAS

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

lionell01
Level 1
Level 1

Hi Balaji, Of course. The timers match or else there would be neighborship issues.

 

Joseph W. Doherty
Hall of Fame
Hall of Fame

Cannot say for sure why your convergence took so long, but I doubt DR/BDR role generally plays a huge role in convergence time, but in this case, perhaps it does, in additional to some other possible issues.

First, you're running OSPF on your FW and it's the DR.  Logically, that's fine, but Cisco routers (and L3 switches) are optimized for routing, don't know how "good" your FW's OSPF implementation is.  (NB: When you get deep into routing implementations, Cisco and some other vendors [e.g. brand "J"] do "things" to improve the working of routing and/or routing protocols, not done by many other vendors.  [E.g. with Cisco's OSPF, "things" like LSA pacing, recalculation back-off timers, the recent iSPF feature, etc.])

In your LLD, you show each ASR has port channel with link to each VSS member, which is good.  Possibly not so good for the FWs pair, if it too needs to transfer data between the FW pair.  If the FW is "sensitive" to "internal" data transfer between its primary and secondary, it might, play poorly, in some situations with VSS, which always uses directly connected egress paths to avoid its own internal data transfer between its VSS pair members.  (I.e. I know how VSS ideally works, don't know your FWs ideal setup, if there is one.)  Logically, does the VSS pair "see" the links to the FW as a port-channel?

You say you're redistributing BGP customer routes into OSPF?  What kind of volume are we discussing?

Are the two ASRs iBGP peers?

Is routing configured on the ASRs such that if some how an outbound packet is received on the bottom ASR, is it sent to the top ASR to go to the WAN, or will the bottom ASR forward the packet to the WAN?

Hi Joseph,

 Logically, does the VSS pair "see" the links to the FW as a port-channel? No, each switch has a single link to the firewall i.e. switch 1 is only connected to fw1 (upper switch and fw in the diagram) and switch 2 to fw2

You say you're redistributing BGP customer routes into OSPF?  What kind of volume are we discussing? Just 3 prefixes

Are the two ASRs iBGP peers? Nope. Both ASR and FW are in a single broadcast OSPF network

Is routing configured on the ASRs such that if some how an outbound packet is received on the bottom ASR, is it sent to the top ASR to go to the WAN, or will the bottom ASR forward the packet to the WAN? Routing is configured such that the lower ASR i.e ASR2 redistributed the routes with a higher metric so traffic from Firewall to customer will only go to ASR1 as long as it is redistributing

 

The FW is seen as a single logical device?  If so, its two links are seen as two host IPs, on the same network?

Hmm, I'm wondering if what appears to be about 3 seconds of additional delay might be due to bottom ASR taking over as BDR (it's configured to allow this?).

With OSPF, when you have unequal paths, the lessor path isn't kept as ready/standby router (as EIGRP might).  With lost of the current path, a new path has to be calculated.  I cannot see that taking 3 seconds, but again, I wonder about the loss of the BDR impact.

If would be interesting if you could redo the test with your bottom ASR being the BDR.  (Assuming re-convergence time was improved, then you still have potential pitfalls, as the bottom ASR could fail, which would migrate BDR to the top ASR, which wouldn't revert back to the bottom ASR when it came back on-line.)

 As all your traffic flow is between the ASRs and FW, perhaps moving to p2p than using DR/BDR.

Hello Joseph,

As per your initial reply, it seems that being a DR or BDR does not affect convergence time. Tested this out in GNS3 with my same timers of 3 and 9. It does take 14-15 seconds for full convergence irrespective of whether firewall is DR and ASR1 BDR or vice versa.

The 2 firewalls are in Active/Passwive state. Only the MAC of the Active firewall is learnt on the switch.

Yes, bottom ASR is allowed to take over as BDR. It has the lowest ospf priority (20 not 0).

I also tried -

1. Setting ASR1 to DROTHER and firewall as BDR and ASR2 as DR

2. Setting ASR1 to DROTHER and firewall as DR and ASR2 as BDR

The convergence time is always about 14-15 seconds.

Attached the debug that I captured. It seems that receiving an ACK from ASR2 and then running the SPF calculations are what take up time. 

That debug information is very interesting.

Indeed, "big" delay waiting for ACK from ASR2 (about 2.502 seconds).  However, not really much delay in running SPF calculations (which appears to have been done in about a millisecond), but there's another "big" multi-second delay (2.503 seconds) in starting their recalculation.

Don't know about the FW's OSPF implementation, but Cisco's OSPF implementations include LSA delay timers and SPF throttling timers to avoid causing OSPF meltdowns due to too frequent OSFP calculations.  (NB: Cisco's documents these OSPF [and EIGRP] timers, and how to adjust, when your want to decrease convergence time.)  Perhaps FW has something similar for its OSPF implementation?

BTW, sort of interesting both delays were just about 2.5 seconds.  The two to three "extra" millisecond might have been actual processing delays.  The combined 5 seconds also appears to account for the bulk of your unexpected delay.

Hello,

 

Along with what others are saying I believe that's about the right time for failover. You have 9 seconds to detect the neighbor down plus a couple seconds for convergence. When you unplugged the LAN cabled from the DR, since its not a direct link failure for the BDR, the BDR didn't know about the failures until just after 9 seconds (maybe add a second for processing and sending notification). The the network had to reconverge with the BDR (and loss of its GW I am assuming) which is now the new DR that would probably take about 3 seconds or so depending on how big your routing table was as Joseph mentioned.

It wont be right at the 9 seconds since that what you configured before it realized the neighbor was dead. Still need a couple seconds to recover. You could try 2 things.

1). Move the BDR to Firewall Primary so when the DR fails the link outage is detected immediately. 

2.) You could configure BFD for OSPF it less processor intense and can detect a neighborship down within 1 second. 

 

Hope that helps

 

-David

Hi David,

Now that you mention it, it does make sense that 9 seconds should not be considered total convergence time but in fact, convergence will begin after 9 seconds once neighborship is declared dead.

We had set the firewall as DR since both ASRs are only redistributing about 3 prefixes into OSPF while on the LAN side there are hundreds of prefixes.

BTW updating that we have OSPF priorities set -  FW - 50 , ASR1- 30 , ASR2 - 20

1. Why would setting ASR1 as DR improve the overall time? This has been suggested to us as well but need to understand it better like what exactly is happening when those LAN cables are pulled. And what about when the LAN is back up since the election is non-preemptive? 

For eg - LAN cables are pulled --> Firewall changes state to DR --> ASR2 continues to be DROTHER

LAN cables re-inserted --> FW continues to be DR --> ASR comes back as BDR?

2. Yes, we have considered BFD but caveat in Palo Alto Firewalls is that BFD forms only between DR and BDR. When ASR1 is down there is only DR (FW) and DROTHER (ASR2), there would be no BFD so perhaps OSPF would be shut down by BFD? 

I dont know but 
there is one FW HA and two ASR, 
why the OSPF instead 
HSRP and FW HA point to VIP of HSRP between two ASR.

Hi, this is because we have another site as well with eBGP neighborships to customer and we need auto-failovers in case this entire site is down. So we have OSPF rather than static with HSRP.

ospf-DR-BDR-design (1).PNG
the PO connect from the ASR to L2 SW, Are L2  SW support any 

vPC, VSS or Stack ?
if not then one PO is suspend and this make L2 recover not L3 recover delay. 

Review Cisco Networking for a $25 gift card