cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2681
Views
0
Helpful
4
Replies

OSPF resets every hour

gnijs
Level 4
Level 4

We recently replaced a routing C3750 switch with an ASR1001X router (running 3.16.01a/15.5(3)S1a)

This device is connected to our 2 core switches (C6500-SUP720). All are configured in backbone area and run subsecond OSPF timers. The backbone contains +/- 40 routers and runs +/- 2300 prefixes. The core was also upgraded to 15.1(2)SY6 during the same maintenance.

Since the replacement, I am seeing that the OSPF connection between this router and both core switches resets exactly every hour.

On the ASR router i am seeing just "LOADING TO FULL, Loading Done" messages, on the core switches, the events start with a "FULL to DOWN, Neighbor Down: Dead Timer expired" message, immeditaly followed by an up: "Loading to FULL, Loading DONE".

A debug on the ASR gives:

Jan  4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4A98842)
Jan  4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4B098AE)
Jan  4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4A98842)
Jan  4 10:41:33: OSPF-1 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7F3FB4B098AE)
Jan  4 10:41:33: OSPF-1 SPF  : Detect change in LSA type 1, LSID 10.96.0.2 from 10.96.0.2 area 0
Jan  4 10:41:33: OSPF-1 INTRA: Insert LSA to New_LSA list type 1, LSID 10.96.0.2, from 10.96.0.2 area 0
Jan  4 10:41:33: OSPF-1 MON  : Schedule Incremental SPF without microloop avoidance in area 0, change in LSA R/10.96.0.2/10.96.0.2
Jan  4 10:41:33: OSPF-1 MON  : reset throttling to 10ms next wait-interval 100ms
Jan  4 10:41:33: OSPF-1 MON  : Schedule SPF in 10ms: spf_time 1w0d, wait_interval 10ms
Jan  4 10:41:33: OSPF-1 SPF  : Detect MAXAGE in LSA type 2, LS ID 10.96.2.89, from 10.96.0.2
Jan  4 10:41:33: OSPF-1 SPF  : Detect generic change in LSA type 2, LSID 10.96.2.89, from 10.96.0.2 area 0
Jan  4 10:41:33: OSPF-1 INTRA: Insert LSA to New_LSA list type 2, LSID 10.96.2.89, from 10.96.0.2 area 0
Jan  4 10:41:33: OSPF-1 MON  : Schedule Incremental SPF without microloop avoidance in area 0, change in LSA N/10.96.2.89/10.96.0.2
Jan  4 10:41:33: OSPF-1 SPF  : Detect change in LSA type 1, LSID 10.96.0.1 from 10.96.0.1 area 0
Jan  4 10:41:33: OSPF-1 INTRA: Insert LSA to New_LSA list type 1, LSID 10.96.0.1, from 10.96.0.1 area 0
Jan  4 10:41:33: OSPF-1 MON  : Schedule Incremental SPF without microloop avoidance in area 0, change in LSA R/10.96.0.1/10.96.0.1
Jan  4 10:41:33: OSPF-1 LSGEN: Rate limit LSA generation for 1/10.96.224.120/10.96.224.120
Jan  4 10:41:33: OSPF-1 MON  : Begin SPF at 605109.769ms, process time 54998ms
Jan  4 10:41:33: OSPF-1 MON  : Last spf_time 1w0d, wait_interval 10ms
Jan  4 10:41:33: OSPF-1 INTRA: Running SPF for area 0, SPF-type Incremental
Jan  4 10:41:33: OSPF-1 INTRA: Initializing to run spf
Jan  4 10:41:33: OSPF-1 INTRA: Running incremental SPF for area 0 

-> after which a lot of messages are generated (logging limited).

I have no idea why this happens every hour. The config is completely the same as before, the timers also. Must be something caused by the difference from IOS to IOS-XE. ASR is running very low on CPU (can't imagine it is a CPU problem)

regards,

Geert

PS.

OSPF config of ASR-1001X

router ospf 1
 router-id 10.96.224.120
 ispf
 nsf
 area 0 authentication message-digest
 timers throttle spf 10 100 5000
 timers throttle lsa 10 100 5000
 timers lsa arrival 80

<cut>

It has 4 OSPF neighbors.

show ip ospf
 Routing Process "ospf 1" with ID 10.96.224.120
 Start time: 00:01:18.760, Time elapsed: 1w0d
 Supports only single TOS(TOS0) routes
 Supports opaque LSA
 Supports Link-local Signaling (LLS)
 Supports area transit capability
 Supports NSSA (compatible with RFC 3101)
 Supports Database Exchange Summary List Optimization (RFC 5243)
 Event-log enabled, Maximum number of events: 1000, Mode: cyclic
 It is an autonomous system boundary router
 Redistributing External Routes from,
    static with metric mapped to 250, includes subnets in redistribution
    bgp 65276 with metric mapped to 5, includes subnets in redistribution
 Router is not originating router-LSAs with maximum metric
 Initial SPF schedule delay 10 msecs
 Minimum hold time between two consecutive SPFs 100 msecs
 Maximum wait time between two consecutive SPFs 5000 msecs
 Incremental-SPF enabled
 Initial LSA throttle delay 10 msecs
 Minimum hold time for LSA throttle 100 msecs
 Maximum wait time for LSA throttle 5000 msecs
 Minimum LSA arrival 80 msecs
 LSA group pacing timer 240 secs
 Interface flood pacing timer 33 msecs
 Retransmission pacing timer 66 msecs
 EXCHANGE/LOADING adjacency limit: initial 300, process maximum 300
 Number of external LSA 2734. Checksum Sum 0x55A59B4
 Number of opaque AS LSA 0. Checksum Sum 0x000000
 Number of DCbitless external and opaque AS LSA 0
 Number of DoNotAge external and opaque AS LSA 0
 Number of areas in this router is 1. 1 normal 0 stub 0 nssa
 Number of areas transit capable is 0
 External flood list length 0
 Non-Stop Forwarding enabled
    Router is not operating in SSO mode
    Global RIB has not converged yet
 IETF NSF helper support enabled
 Cisco NSF helper support enabled
 Reference bandwidth unit is 100 mbps
    Area BACKBONE(0)
        Number of interfaces in this area is 9 (1 loopback)
        Area has message digest authentication
        SPF algorithm last executed 00:22:25.078 ago
        SPF algorithm executed 407 times
        Area ranges are
        Number of LSA 243. Checksum Sum 0x86BC58
        Number of opaque link LSA 0. Checksum Sum 0x000000
        Number of DCbitless LSA 0
        Number of indication LSA 0
        Number of DoNotAge LSA 0
        Flood list length 0

OSPF config of a core switch:

router ospf 1000
 router-id 10.96.0.2
 max-metric router-lsa on-startup 120
 ispf
 nsf
 area 0.0.0.0 authentication message-digest
 area 1.0.0.11 stub no-summary
 area 1.0.0.11 range 10.98.0.0 255.255.192.0 cost 10
 area 1.0.0.20 nssa
 area 1.1.0.20 authentication message-digest
 area 1.1.0.20 nssa no-summary
 timers throttle spf 10 100 5000
 timers throttle lsa 10 100 5000
 timers lsa arrival 80
 bfd all-interfaces

<cut>

 

Each core switch has 11 sub-second OSPF neighbors.

Routing Process "ospf 1000" with ID 10.96.0.2
 Start time: 00:02:28.336, Time elapsed: 1w0d
 Supports only single TOS(TOS0) routes
 Supports opaque LSA
 Supports Link-local Signaling (LLS)
 Supports area transit capability
 Supports NSSA (compatible with RFC 3101)
 Event-log enabled, Maximum number of events: 1000, Mode: cyclic
 It is an area border and autonomous system boundary router
 Redistributing External Routes from,
 Originating router-LSAs with maximum metric
    Condition: on startup for 120 seconds, State: inactive
    Unset reason: timer expired, Originated for 120 seconds
    Unset time: 00:04:28.344, Time elapsed: 1w0d
 Initial SPF schedule delay 10 msecs
 Minimum hold time between two consecutive SPFs 100 msecs
 Maximum wait time between two consecutive SPFs 5000 msecs
 Incremental-SPF enabled
 Initial LSA throttle delay 10 msecs
 Minimum hold time for LSA throttle 100 msecs
 Maximum wait time for LSA throttle 5000 msecs
 Minimum LSA arrival 80 msecs
 LSA group pacing timer 240 secs
 Interface flood pacing timer 33 msecs
 Retransmission pacing timer 66 msecs
 Number of external LSA 2734. Checksum Sum 0x55A53B7
 Number of opaque AS LSA 0. Checksum Sum 0x000000
 Number of DCbitless external and opaque AS LSA 0
 Number of DoNotAge external and opaque AS LSA 0
 Number of areas in this router is 4. 1 normal 1 stub 2 nssa
 Number of areas transit capable is 0
 External flood list length 0
 Non-Stop Forwarding enabled
    Global RIB has not converged yet
 IETF NSF helper support enabled
 Cisco NSF helper support enabled
 BFD is enabled
 Reference bandwidth unit is 100 mbps
    Area BACKBONE(0.0.0.0)
        Number of interfaces in this area is 12 (2 loopback)
        Area has message digest authentication
        SPF algorithm last executed 00:22:43.676 ago
        SPF algorithm executed 1204 times
        Area ranges are
        Number of LSA 243. Checksum Sum 0x86BC58
        Number of opaque link LSA 0. Checksum Sum 0x000000
        Number of DCbitless LSA 0
        Number of indication LSA 0
        Number of DoNotAge LSA 0
        Flood list length 0
    Area 1.0.0.11
        Number of interfaces in this area is 0
        It is a stub area, no summary LSA in this area
        Generates stub default route with cost 1
        Area has no authentication
        SPF algorithm last executed 1w0d ago
        SPF algorithm executed 4 times
        Area ranges are
           10.98.0.0/18 Passive Advertise
        Number of LSA 1. Checksum Sum 0x00D663
        Number of opaque link LSA 0. Checksum Sum 0x000000
        Number of DCbitless LSA 0
        Number of indication LSA 0
        Number of DoNotAge LSA 0
        Flood list length 0
    Area 1.0.0.20
        Number of interfaces in this area is 0
        It is a NSSA area
        Perform type-7/type-5 LSA translation
        Area has no authentication
        SPF algorithm last executed 1w0d ago
        SPF algorithm executed 4 times
        Area ranges are
        Number of LSA 224. Checksum Sum 0x71161D
        Number of opaque link LSA 0. Checksum Sum 0x000000
        Number of DCbitless LSA 0
        Number of indication LSA 0
        Number of DoNotAge LSA 0
        Flood list length 0
    Area 1.1.0.20
        Number of interfaces in this area is 1
        It is a NSSA area
        Perform type-7/type-5 LSA translation
        Area has message digest authentication
        SPF algorithm last executed 1w0d ago
        SPF algorithm executed 5 times
        Area ranges are
        Number of LSA 8. Checksum Sum 0x058E4B
        Number of opaque link LSA 0. Checksum Sum 0x000000
        Number of DCbitless LSA 0
        Number of indication LSA 0
        Number of DoNotAge LSA 0
        Flood list length 0

4 Replies 4

Philip D'Ath
VIP Alumni
VIP Alumni

This is going to be tricky because there were two changes at the same time.  Since the 6500's have 11 or so other sub-second OSPF routers attached, and assuming they are not having the same issue, you would like to think it is not the core with the issue.

I am most interested in the core reporting 'FULL to DOWN, Neighbor Down: Dead Timer expired'.  It thinks it didn't get its hello's.  What is the physical connectivity between these two?  Any chance of an issue here?  Seems unlikely if it is happening on a regular 60 minute schedule.

Is there any QoS on the link between the two?  If there is, any chance a flood of higher priority traffic is starving OSPF?

Does network monitoring show any traffic bursts in general happening every 60 minutes?  Being sub-second you don't need a very big burst to potentially upset OSPF .  If you are not using QoS it might be worth an experiment to try making OSPF a "priority" traffic.

Is there anything else special about the link between the two? Using BFD or anything else like that?

I wonder if the ASR has miscalculated a timer, or failed to schedule something at the time it was meant to be scheduled.  Do you have any other ASR1001-X's in your environment?  If so, any running sub-second OSPF?  If so, what software version are they running?

Hello p.dath,

1) I agree: "dead timer expired" seems to indicate the core is not receiving any hellos from the ASR anymore. Physical connectivity is an LX fiber. I think the ASR is so busy doing his recalculation (that is misses sending OSPF hello's), this results in the core resetting the connection, even generating more LSA updates.

2) QOS = standard QOS, this is enabled. Unlikely that high priority is starving both core links at the same time, but i will check anyway. Standard network monitoring is not indicating anything special here.

3) Other special things: i noticed on the core there is a "bfd all-interfaces" configured. This command is not supported on the ASR. This is the only difference i see in the config.

4) I will try to reload the OSPF process on the ASR anyway, just to be sure.

5) Another option that will improve convergence is the following: the fiber link is today considered an OSPF NETWORK type because it is a /29 subnet. Because it is a fysical link, we can convert the OSPF NETWORK type to point-to-point and eliminate subsecond timers completely. I have done this before on other routers to lighten the CPU load, however, on the C6500 core, this has never been a problem.

6) Other ASRs -> no this is the first one.

mvg,

Geert

I vote on option 5.  It is the best all round solution.

Hi,

 

Did you find the solution?

I'm facing the same issue.