08-12-2022 09:00 AM - edited 08-16-2022 04:46 AM
I've attached a LLD of my design. The ASRs have eBGP neighborship with WAN side. In the LAN, both ASRs and the firewall are in the same OSPF broadcast network. The L2 switch in between is a 6800 in VSS. Both ASRs learn the same customer routes from eBGP and redistribute into OSPF but ASR2 redistributes with a higher metric.
Our OSPF timers are hello - 3 and dead - 9
updating that we have OSPF priorities set - FW - 50 , ASR1- 30 , ASR2 - 20
We had a planned failover activity wherein both LAN cables on ASR 1 were pulled. We expected failover to happen in 9 seconds. However, it took 13 seconds for the route on the firewall to move from ASR 1 to ASR2. Did the firewall being the DR result in more convergence time? Would it have helped if ASR1 was the DR? How does convergence work in broadcast OSPF network in detail would really like to understand?
Solved! Go to Solution.
08-16-2022 09:40 AM
That debug information is very interesting.
Indeed, "big" delay waiting for ACK from ASR2 (about 2.502 seconds). However, not really much delay in running SPF calculations (which appears to have been done in about a millisecond), but there's another "big" multi-second delay (2.503 seconds) in starting their recalculation.
Don't know about the FW's OSPF implementation, but Cisco's OSPF implementations include LSA delay timers and SPF throttling timers to avoid causing OSPF meltdowns due to too frequent OSFP calculations. (NB: Cisco's documents these OSPF [and EIGRP] timers, and how to adjust, when your want to decrease convergence time.) Perhaps FW has something similar for its OSPF implementation?
BTW, sort of interesting both delays were just about 2.5 seconds. The two to three "extra" millisecond might have been actual processing delays. The combined 5 seconds also appears to account for the bulk of your unexpected delay.
08-12-2022 09:20 AM
we need to see the configuration also, you mentioned you pulled the cables from ASR1 (which one is ASR1 in your config ?)
how about external side - is the eBGP still up ? (since you pulled only Lan side rigt ?)
VSS switch act as Layer 2 ? you have OSPF neighbourship with FW right ? what FW ? what kind of timers you have on FW ?
08-12-2022 09:31 AM
Hi Balaji,
we need to see the configuration also - please mention which outputs exactly? will modify and share
which one is ASR1 in your config ? the upper router
how about external side - is the eBGP still up ? (since you pulled only Lan side rigt ?) Yes eBGP was still up. We're only dealing with LAN side convergence here though.
VSS switch act as Layer 2 ? that's not it's only purpose, of course. It's Layer 2 for this setup specifically.
you have OSPF neighbourship with FW right ? what FW ? what kind of timers you have on FW ? Yes, it's a PA-3250. Which timers specifically?
08-12-2022 09:35 AM
Timer should match to to detect as dead. see palo settings :
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClsKCAS
08-12-2022 10:15 AM
Hi Balaji, Of course. The timers match or else there would be neighborship issues.
08-12-2022 01:50 PM
Cannot say for sure why your convergence took so long, but I doubt DR/BDR role generally plays a huge role in convergence time, but in this case, perhaps it does, in additional to some other possible issues.
First, you're running OSPF on your FW and it's the DR. Logically, that's fine, but Cisco routers (and L3 switches) are optimized for routing, don't know how "good" your FW's OSPF implementation is. (NB: When you get deep into routing implementations, Cisco and some other vendors [e.g. brand "J"] do "things" to improve the working of routing and/or routing protocols, not done by many other vendors. [E.g. with Cisco's OSPF, "things" like LSA pacing, recalculation back-off timers, the recent iSPF feature, etc.])
In your LLD, you show each ASR has port channel with link to each VSS member, which is good. Possibly not so good for the FWs pair, if it too needs to transfer data between the FW pair. If the FW is "sensitive" to "internal" data transfer between its primary and secondary, it might, play poorly, in some situations with VSS, which always uses directly connected egress paths to avoid its own internal data transfer between its VSS pair members. (I.e. I know how VSS ideally works, don't know your FWs ideal setup, if there is one.) Logically, does the VSS pair "see" the links to the FW as a port-channel?
You say you're redistributing BGP customer routes into OSPF? What kind of volume are we discussing?
Are the two ASRs iBGP peers?
Is routing configured on the ASRs such that if some how an outbound packet is received on the bottom ASR, is it sent to the top ASR to go to the WAN, or will the bottom ASR forward the packet to the WAN?
08-16-2022 04:26 AM
Hi Joseph,
Logically, does the VSS pair "see" the links to the FW as a port-channel? No, each switch has a single link to the firewall i.e. switch 1 is only connected to fw1 (upper switch and fw in the diagram) and switch 2 to fw2
You say you're redistributing BGP customer routes into OSPF? What kind of volume are we discussing? Just 3 prefixes
Are the two ASRs iBGP peers? Nope. Both ASR and FW are in a single broadcast OSPF network
Is routing configured on the ASRs such that if some how an outbound packet is received on the bottom ASR, is it sent to the top ASR to go to the WAN, or will the bottom ASR forward the packet to the WAN? Routing is configured such that the lower ASR i.e ASR2 redistributed the routes with a higher metric so traffic from Firewall to customer will only go to ASR1 as long as it is redistributing
08-16-2022 07:13 AM
The FW is seen as a single logical device? If so, its two links are seen as two host IPs, on the same network?
Hmm, I'm wondering if what appears to be about 3 seconds of additional delay might be due to bottom ASR taking over as BDR (it's configured to allow this?).
With OSPF, when you have unequal paths, the lessor path isn't kept as ready/standby router (as EIGRP might). With lost of the current path, a new path has to be calculated. I cannot see that taking 3 seconds, but again, I wonder about the loss of the BDR impact.
If would be interesting if you could redo the test with your bottom ASR being the BDR. (Assuming re-convergence time was improved, then you still have potential pitfalls, as the bottom ASR could fail, which would migrate BDR to the top ASR, which wouldn't revert back to the bottom ASR when it came back on-line.)
As all your traffic flow is between the ASRs and FW, perhaps moving to p2p than using DR/BDR.
08-16-2022 08:03 AM - edited 08-16-2022 08:08 AM
Hello Joseph,
As per your initial reply, it seems that being a DR or BDR does not affect convergence time. Tested this out in GNS3 with my same timers of 3 and 9. It does take 14-15 seconds for full convergence irrespective of whether firewall is DR and ASR1 BDR or vice versa.
The 2 firewalls are in Active/Passwive state. Only the MAC of the Active firewall is learnt on the switch.
Yes, bottom ASR is allowed to take over as BDR. It has the lowest ospf priority (20 not 0).
I also tried -
1. Setting ASR1 to DROTHER and firewall as BDR and ASR2 as DR
2. Setting ASR1 to DROTHER and firewall as DR and ASR2 as BDR
The convergence time is always about 14-15 seconds.
Attached the debug that I captured. It seems that receiving an ACK from ASR2 and then running the SPF calculations are what take up time.
08-16-2022 09:40 AM
That debug information is very interesting.
Indeed, "big" delay waiting for ACK from ASR2 (about 2.502 seconds). However, not really much delay in running SPF calculations (which appears to have been done in about a millisecond), but there's another "big" multi-second delay (2.503 seconds) in starting their recalculation.
Don't know about the FW's OSPF implementation, but Cisco's OSPF implementations include LSA delay timers and SPF throttling timers to avoid causing OSPF meltdowns due to too frequent OSFP calculations. (NB: Cisco's documents these OSPF [and EIGRP] timers, and how to adjust, when your want to decrease convergence time.) Perhaps FW has something similar for its OSPF implementation?
BTW, sort of interesting both delays were just about 2.5 seconds. The two to three "extra" millisecond might have been actual processing delays. The combined 5 seconds also appears to account for the bulk of your unexpected delay.
08-12-2022 02:30 PM - edited 08-12-2022 02:30 PM
Hello,
Along with what others are saying I believe that's about the right time for failover. You have 9 seconds to detect the neighbor down plus a couple seconds for convergence. When you unplugged the LAN cabled from the DR, since its not a direct link failure for the BDR, the BDR didn't know about the failures until just after 9 seconds (maybe add a second for processing and sending notification). The the network had to reconverge with the BDR (and loss of its GW I am assuming) which is now the new DR that would probably take about 3 seconds or so depending on how big your routing table was as Joseph mentioned.
It wont be right at the 9 seconds since that what you configured before it realized the neighbor was dead. Still need a couple seconds to recover. You could try 2 things.
1). Move the BDR to Firewall Primary so when the DR fails the link outage is detected immediately.
2.) You could configure BFD for OSPF it less processor intense and can detect a neighborship down within 1 second.
Hope that helps
-David
08-16-2022 04:44 AM - edited 08-16-2022 04:47 AM
Hi David,
Now that you mention it, it does make sense that 9 seconds should not be considered total convergence time but in fact, convergence will begin after 9 seconds once neighborship is declared dead.
We had set the firewall as DR since both ASRs are only redistributing about 3 prefixes into OSPF while on the LAN side there are hundreds of prefixes.
BTW updating that we have OSPF priorities set - FW - 50 , ASR1- 30 , ASR2 - 20
1. Why would setting ASR1 as DR improve the overall time? This has been suggested to us as well but need to understand it better like what exactly is happening when those LAN cables are pulled. And what about when the LAN is back up since the election is non-preemptive?
For eg - LAN cables are pulled --> Firewall changes state to DR --> ASR2 continues to be DROTHER
LAN cables re-inserted --> FW continues to be DR --> ASR comes back as BDR?
2. Yes, we have considered BFD but caveat in Palo Alto Firewalls is that BFD forms only between DR and BDR. When ASR1 is down there is only DR (FW) and DROTHER (ASR2), there would be no BFD so perhaps OSPF would be shut down by BFD?
08-12-2022 03:20 PM
I dont know but
there is one FW HA and two ASR,
why the OSPF instead
HSRP and FW HA point to VIP of HSRP between two ASR.
08-16-2022 04:27 AM
Hi, this is because we have another site as well with eBGP neighborships to customer and we need auto-failovers in case this entire site is down. So we have OSPF rather than static with HSRP.
08-16-2022 05:16 AM
the PO connect from the ASR to L2 SW, Are L2 SW support any
vPC, VSS or Stack ?
if not then one PO is suspend and this make L2 recover not L3 recover delay.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide