01-04-2016 04:07 AM - edited 03-08-2019 03:16 AM
We recently replaced a routing C3750 switch with an ASR1001X router (running 3.16.01a/15.5(3)S1a)
This device is connected to our 2 core switches (C6500-SUP720). All are configured in backbone area and run subsecond OSPF timers. The backbone contains +/- 40 routers and runs +/- 2300 prefixes. The core was also upgraded to 15.1(2)SY6 during the same maintenance.
Since the replacement, I am seeing that the OSPF connection between this router and both core switches resets exactly every hour.
On the ASR router i am seeing just "LOADING TO FULL, Loading Done" messages, on the core switches, the events start with a "FULL to DOWN, Neighbor Down: Dead Timer expired" message, immeditaly followed by an up: "Loading to FULL, Loading DONE".
A debug on the ASR gives:
Jan 4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4A98842)
Jan 4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4B098AE)
Jan 4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4A98842)
Jan 4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4B098AE)
Jan 4 10:41:33: OSPF-1 SPF : Detect change in LSA type 1, LSID 10.96.0.2 from 10.96.0.2 area 0
Jan 4 10:41:33: OSPF-1 INTRA: Insert LSA to New_LSA list type 1, LSID 10.96.0.2, from 10.96.0.2 area 0
Jan 4 10:41:33: OSPF-1 MON : Schedule Incremental SPF without microloop avoidance in area 0, change in LSA R/10.96.0.2/10.96.0.2
Jan 4 10:41:33: OSPF-1 MON : reset throttling to 10ms next wait-interval 100ms
Jan 4 10:41:33: OSPF-1 MON : Schedule SPF in 10ms: spf_time 1w0d, wait_interval 10ms
Jan 4 10:41:33: OSPF-1 SPF : Detect MAXAGE in LSA type 2, LS ID 10.96.2.89, from 10.96.0.2
Jan 4 10:41:33: OSPF-1 SPF : Detect generic change in LSA type 2, LSID 10.96.2.89, from 10.96.0.2 area 0
Jan 4 10:41:33: OSPF-1 INTRA: Insert LSA to New_LSA list type 2, LSID 10.96.2.89, from 10.96.0.2 area 0
Jan 4 10:41:33: OSPF-1 MON : Schedule Incremental SPF without microloop avoidance in area 0, change in LSA N/10.96.2.89/10.96.0.2
Jan 4 10:41:33: OSPF-1 SPF : Detect change in LSA type 1, LSID 10.96.0.1 from 10.96.0.1 area 0
Jan 4 10:41:33: OSPF-1 INTRA: Insert LSA to New_LSA list type 1, LSID 10.96.0.1, from 10.96.0.1 area 0
Jan 4 10:41:33: OSPF-1 MON : Schedule Incremental SPF without microloop avoidance in area 0, change in LSA R/10.96.0.1/10.96.0.1
Jan 4 10:41:33: OSPF-1 LSGEN: Rate limit LSA generation for 1/10.96.224.120/10.96.224.120
Jan 4 10:41:33: OSPF-1 MON : Begin SPF at 605109.769ms, process time 54998ms
Jan 4 10:41:33: OSPF-1 MON : Last spf_time 1w0d, wait_interval 10ms
Jan 4 10:41:33: OSPF-1 INTRA: Running SPF for area 0, SPF-type Incremental
Jan 4 10:41:33: OSPF-1 INTRA: Initializing to run spf
Jan 4 10:41:33: OSPF-1 INTRA: Running incremental SPF for area 0
-> after which a lot of messages are generated (logging limited).
I have no idea why this happens every hour. The config is completely the same as before, the timers also. Must be something caused by the difference from IOS to IOS-XE. ASR is running very low on CPU (can't imagine it is a CPU problem)
regards,
Geert
PS.
OSPF config of ASR-1001X
router ospf 1
router-id 10.96.224.120
ispf
nsf
area 0 authentication message-digest
timers throttle spf 10 100 5000
timers throttle lsa 10 100 5000
timers lsa arrival 80
<cut>
It has 4 OSPF neighbors.
show ip ospf
Routing Process "ospf 1" with ID 10.96.224.120
Start time: 00:01:18.760, Time elapsed: 1w0d
Supports only single TOS(TOS0) routes
Supports opaque LSA
Supports Link-local Signaling (LLS)
Supports area transit capability
Supports NSSA (compatible with RFC 3101)
Supports Database Exchange Summary List Optimization (RFC 5243)
Event-log enabled, Maximum number of events: 1000, Mode: cyclic
It is an autonomous system boundary router
Redistributing External Routes from,
static with metric mapped to 250, includes subnets in redistribution
bgp 65276 with metric mapped to 5, includes subnets in redistribution
Router is not originating router-LSAs with maximum metric
Initial SPF schedule delay 10 msecs
Minimum hold time between two consecutive SPFs 100 msecs
Maximum wait time between two consecutive SPFs 5000 msecs
Incremental-SPF enabled
Initial LSA throttle delay 10 msecs
Minimum hold time for LSA throttle 100 msecs
Maximum wait time for LSA throttle 5000 msecs
Minimum LSA arrival 80 msecs
LSA group pacing timer 240 secs
Interface flood pacing timer 33 msecs
Retransmission pacing timer 66 msecs
EXCHANGE/LOADING adjacency limit: initial 300, process maximum 300
Number of external LSA 2734. Checksum Sum 0x55A59B4
Number of opaque AS LSA 0. Checksum Sum 0x000000
Number of DCbitless external and opaque AS LSA 0
Number of DoNotAge external and opaque AS LSA 0
Number of areas in this router is 1. 1 normal 0 stub 0 nssa
Number of areas transit capable is 0
External flood list length 0
Non-Stop Forwarding enabled
Router is not operating in SSO mode
Global RIB has not converged yet
IETF NSF helper support enabled
Cisco NSF helper support enabled
Reference bandwidth unit is 100 mbps
Area BACKBONE(0)
Number of interfaces in this area is 9 (1 loopback)
Area has message digest authentication
SPF algorithm last executed 00:22:25.078 ago
SPF algorithm executed 407 times
Area ranges are
Number of LSA 243. Checksum Sum 0x86BC58
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
OSPF config of a core switch:
router ospf 1000
router-id 10.96.0.2
max-metric router-lsa on-startup 120
ispf
nsf
area 0.0.0.0 authentication message-digest
area 1.0.0.11 stub no-summary
area 1.0.0.11 range 10.98.0.0 255.255.192.0 cost 10
area 1.0.0.20 nssa
area 1.1.0.20 authentication message-digest
area 1.1.0.20 nssa no-summary
timers throttle spf 10 100 5000
timers throttle lsa 10 100 5000
timers lsa arrival 80
bfd all-interfaces
<cut>
Each core switch has 11 sub-second OSPF neighbors.
Routing Process "ospf 1000" with ID 10.96.0.2
Start time: 00:02:28.336, Time elapsed: 1w0d
Supports only single TOS(TOS0) routes
Supports opaque LSA
Supports Link-local Signaling (LLS)
Supports area transit capability
Supports NSSA (compatible with RFC 3101)
Event-log enabled, Maximum number of events: 1000, Mode: cyclic
It is an area border and autonomous system boundary router
Redistributing External Routes from,
Originating router-LSAs with maximum metric
Condition: on startup for 120 seconds, State: inactive
Unset reason: timer expired, Originated for 120 seconds
Unset time: 00:04:28.344, Time elapsed: 1w0d
Initial SPF schedule delay 10 msecs
Minimum hold time between two consecutive SPFs 100 msecs
Maximum wait time between two consecutive SPFs 5000 msecs
Incremental-SPF enabled
Initial LSA throttle delay 10 msecs
Minimum hold time for LSA throttle 100 msecs
Maximum wait time for LSA throttle 5000 msecs
Minimum LSA arrival 80 msecs
LSA group pacing timer 240 secs
Interface flood pacing timer 33 msecs
Retransmission pacing timer 66 msecs
Number of external LSA 2734. Checksum Sum 0x55A53B7
Number of opaque AS LSA 0. Checksum Sum 0x000000
Number of DCbitless external and opaque AS LSA 0
Number of DoNotAge external and opaque AS LSA 0
Number of areas in this router is 4. 1 normal 1 stub 2 nssa
Number of areas transit capable is 0
External flood list length 0
Non-Stop Forwarding enabled
Global RIB has not converged yet
IETF NSF helper support enabled
Cisco NSF helper support enabled
BFD is enabled
Reference bandwidth unit is 100 mbps
Area BACKBONE(0.0.0.0)
Number of interfaces in this area is 12 (2 loopback)
Area has message digest authentication
SPF algorithm last executed 00:22:43.676 ago
SPF algorithm executed 1204 times
Area ranges are
Number of LSA 243. Checksum Sum 0x86BC58
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
Area 1.0.0.11
Number of interfaces in this area is 0
It is a stub area, no summary LSA in this area
Generates stub default route with cost 1
Area has no authentication
SPF algorithm last executed 1w0d ago
SPF algorithm executed 4 times
Area ranges are
10.98.0.0/18 Passive Advertise
Number of LSA 1. Checksum Sum 0x00D663
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
Area 1.0.0.20
Number of interfaces in this area is 0
It is a NSSA area
Perform type-7/type-5 LSA translation
Area has no authentication
SPF algorithm last executed 1w0d ago
SPF algorithm executed 4 times
Area ranges are
Number of LSA 224. Checksum Sum 0x71161D
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
Area 1.1.0.20
Number of interfaces in this area is 1
It is a NSSA area
Perform type-7/type-5 LSA translation
Area has message digest authentication
SPF algorithm last executed 1w0d ago
SPF algorithm executed 5 times
Area ranges are
Number of LSA 8. Checksum Sum 0x058E4B
Number of opaque link LSA 0. Checksum Sum 0x000000
Number of DCbitless LSA 0
Number of indication LSA 0
Number of DoNotAge LSA 0
Flood list length 0
01-04-2016 04:59 AM
This is going to be tricky because there were two changes at the same time. Since the 6500's have 11 or so other sub-second OSPF routers attached, and assuming they are not having the same issue, you would like to think it is not the core with the issue.
I am most interested in the core reporting 'FULL to DOWN, Neighbor Down: Dead Timer expired'. It thinks it didn't get its hello's. What is the physical connectivity between these two? Any chance of an issue here? Seems unlikely if it is happening on a regular 60 minute schedule.
Is there any QoS on the link between the two? If there is, any chance a flood of higher priority traffic is starving OSPF?
Does network monitoring show any traffic bursts in general happening every 60 minutes? Being sub-second you don't need a very big burst to potentially upset OSPF . If you are not using QoS it might be worth an experiment to try making OSPF a "priority" traffic.
Is there anything else special about the link between the two? Using BFD or anything else like that?
I wonder if the ASR has miscalculated a timer, or failed to schedule something at the time it was meant to be scheduled. Do you have any other ASR1001-X's in your environment? If so, any running sub-second OSPF? If so, what software version are they running?
01-04-2016 06:48 AM
Hello p.dath,
1) I agree: "dead timer expired" seems to indicate the core is not receiving any hellos from the ASR anymore. Physical connectivity is an LX fiber. I think the ASR is so busy doing his recalculation (that is misses sending OSPF hello's), this results in the core resetting the connection, even generating more LSA updates.
2) QOS = standard QOS, this is enabled. Unlikely that high priority is starving both core links at the same time, but i will check anyway. Standard network monitoring is not indicating anything special here.
3) Other special things: i noticed on the core there is a "bfd all-interfaces" configured. This command is not supported on the ASR. This is the only difference i see in the config.
4) I will try to reload the OSPF process on the ASR anyway, just to be sure.
5) Another option that will improve convergence is the following: the fiber link is today considered an OSPF NETWORK type because it is a /29 subnet. Because it is a fysical link, we can convert the OSPF NETWORK type to point-to-point and eliminate subsecond timers completely. I have done this before on other routers to lighten the CPU load, however, on the C6500 core, this has never been a problem.
6) Other ASRs -> no this is the first one.
mvg,
Geert
01-04-2016 06:51 AM
I vote on option 5. It is the best all round solution.
12-03-2017 06:02 AM
Hi,
Did you find the solution?
I'm facing the same issue.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide