Re: Core Network OSPF Down

mmendis · ‎06-02-2020

Jun 2 18:33:19.604 GMT: %BGP-3-NOTIFICATION: sent to neighbor 10.10.10.1 4/0 (hold time expired) 0 bytes
Jun 2 18:33:19.604 GMT: %BGP-5-NBR_RESET: Neighbor 10.10.10.1 reset (BGP Notification sent)
Jun 2 18:33:19.606 GMT: %BGP-5-ADJCHANGE: neighbor 10.10.10.1 Down BGP Notification sent
Jun 2 18:33:19.606 GMT: %BGP_SESSION-5-ADJCHANGE: neighbor 10.10.10.1 IPv4 Unicast topology base removed from session BGP Notification sent
Jun 2 18:33:20.493 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from FULL to DOWN, Neighbor Down: Dead timer expired
Jun 2 18:33:20.493 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from FULL to DOWN, Neighbor Down: Dead timer expired
Jun 2 18:33:20.495 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from FULL to DOWN, Neighbor Down: Dead timer expired
Jun 2 18:33:20.608 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from FULL to DOWN, Neighbor Down: Dead timer expired
Jun 2 18:33:20.830 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from LOADING to FULL, Loading Done
Jun 2 18:33:26.712 GMT: %BGP-3-NOTIFICATION: received from neighbor 10.10.10.1 4/0 (hold time expired) 0 bytes
Jun 2 18:33:26.712 GMT: %BGP-5-NBR_RESET: Neighbor 10.10.10.1 reset (BGP Notification received)
Jun 2 18:33:26.714 GMT: %BGP-5-ADJCHANGE: neighbor 10.10.10.1 Down BGP Notification received
Jun 2 18:33:26.714 GMT: %BGP_SESSION-5-ADJCHANGE: neighbor 10.10.10.1 IPv4 Unicast topology base removed from session BGP Notification received
Jun 2 18:33:27.615 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from LOADING to FULL, Loading Done
Jun 2 18:33:27.623 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from LOADING to FULL, Loading Done
Jun 2 18:33:27.630 GMT: %OSPF-5-ADJCHG: Process 10, Nbr 10.10.10.1 on Port-channel1 from LOADING to FULL, Loading Done
Jun 2 18:33:32.690 GMT: %BGP-5-ADJCHANGE: neighbor 10.10.10.1 Up
Jun 2 18:33:40.882 GMT: %BGP-5-ADJCHANGE: neighbor 10.10.10.1 Up

I cant find the issue why the OSPF suddenly went down and came up. I checked the connectivity found no issue. need some insight on how to do the troubleshooting.

GailLSimpsonLSimpson50667 · ‎06-02-2020

I am also facing a similar issue. Do let me know as well if you find out any helpful guide. URL

pieterh · ‎06-02-2020

Yes?, and what is the question we can help you with?

GailLSimpsonLSimpson50667 · ‎06-02-2020

I am also facing a similar issue. Do let me know as well if you find out any helpful guide.

mmendis · ‎06-02-2020

Will Do.https://community.cisco.com/t5/forums/replypage/board-id/5991-discussions-wan-routing-switching/message-id/335234

mmendis · ‎06-02-2020

We cant find the issue why the OSPF suddenly went down and came up. we checked the connectivity found no issue.

mmendis · ‎06-02-2020

We cant find the issue why the OSPF suddenly went down and came up. we checked the connectivity found no issue.

Leo Laohoo · ‎06-02-2020

I second what @Giuseppe Larosa said.
Also look at the time the BGP & OSPF went down and up. It points to a link flap.
The logs does not show if PO 1 and any of the physical interface had any problems or not.

Giuseppe Larosa · ‎06-02-2020

Hello @mmendis ,

you have both OSPF and BGP running over a port-channel 1.

The log messages show the following:

a) the BGP session is turned down because the BGP hold time has expired (notification sent )

The BGP session is removed and all learned BGP prefixes are removed from IPv4 unicast topology.

b) Also OSPF sees the dead interval to expire and the OSPF neighborship is turned down on the same port-channel 1

At this point all user traffic is removed from port channel 1 and both OSPF and BGP have a chance to recover.

To be noted if there is too much traffic over the port channel and one of two devices do not use QoS to protect OSPF hellos and BGP keepalives all the above can repeat many times: the traffic becomes so much and the lack of an implicit and explicit protection for OSPF and BGP messages can cause instability.

You need to verify what platforms are in use on the local device and the remote device, how many member links are in the port-channel 1 and if there is any form of QoS protection for OSPF and BGP.

You can use show interface of member links and look for output drops and input errors.

show version can be useful of both devices.

Hope to help

Giuseppe

mmendis · ‎06-02-2020

Hi Giuseppe,

You input is very helpful thank you.

There is no QoS enabled for the BGP/OSPF hello packets.

I checked sh int for port-channel 1 bellow is the output

Port-channel1 is up, line protocol is up
Hardware is GEChannel, address is 683b.7884.22c0 (bia 683b.7884.22c0)
Description: Downlink | 0 Range Stack
Internet address is 10.10.10.1/24
MTU 1500 bytes, BW 4000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 16/255, rxload 14/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
ARP type: ARPA, ARP Timeout 04:00:00
No. of active members in this channel: 4
Member 0 : GigabitEthernet0/0/0 , Full-duplex, 1000Mb/s
Member 1 : GigabitEthernet0/0/1 , Full-duplex, 1000Mb/s
Member 2 : GigabitEthernet0/0/2 , Full-duplex, 1000Mb/s
Member 3 : GigabitEthernet0/0/3 , Full-duplex, 1000Mb/s
No. of PF_JUMBO supported members in this channel : 4
Last input 00:00:00, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 1/1500/0/2 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/160 (size/max)
30 second input rate 244395000 bits/sec, 68535 packets/sec
30 second output rate 251064000 bits/sec, 46507 packets/sec
1712265985567 packets input, 686051887172913 bytes, 0 no buffer
Received 321745649 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
2 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 138357633 multicast, 0 pause input
1190291669430 packets output, 894255286980320 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
80270 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

I am not clear about the what the platform you mean by the device type or the OS?

There are 25 OSPF members on the link and There are 32 BGP members on the link.

All the OSPF members went down

sh ip ospf neighbor detail | s Neighbor is up
Neighbor is up for 04:21:44
Neighbor is up for 04:21:44
Neighbor is up for 04:21:49
Neighbor is up for 04:21:49
Neighbor is up for 04:21:49
Neighbor is up for 04:21:44
Neighbor is up for 04:21:49
Neighbor is up for 04:21:47
Neighbor is up for 04:21:51
Neighbor is up for 04:21:52
Neighbor is up for 04:21:48
Neighbor is up for 04:21:47
Neighbor is up for 03:19:38
Neighbor is up for 03:19:38
Neighbor is up for 03:19:38
Neighbor is up for 04:21:44
Neighbor is up for 03:19:45
Neighbor is up for 04:21:48
Neighbor is up for 04:21:46
Neighbor is up for 04:21:43
Neighbor is up for 04:21:44
Neighbor is up for 04:21:49

Sh version output

uptime is 29 weeks, 4 days, 19 hours, 49 minutes
Uptime for this control processor is 29 weeks, 4 days, 19 hours, 50 minutes
System returned to ROM by reload
System restarted at 02:04:34 GMT Fri Nov 8 2019
System image file is "bootflash:asr1001x-universalk9.16.06.07.SPA.bin"
Last reload reason: Reload Command

Regards,

Aravinthan

Giuseppe Larosa · ‎06-02-2020

Hello @mmendis ,

you have posted ouput from the device with IP address 10.10.10.1. It is an ASR1000-X.

My understanding is that this device is the other device looking at the logs in your first post.

>> There are 25 OSPF members on the link and There are 32 BGP members on the link.

Well, this means that this is not a point to point link between two devices but you have a VLAN broadcast domain implemented over one or more LAN switches.

I think you should use QoS to protect OSPF and BGP messages on every device connecting to this backbone VLAN.

access-list 111 remark BGP and OSPF

access-list 111 permit ospf any any

access-list 111 permit tcp any any eq bgp

access-list 111 permit tcp any eq bgp any

class-map RoutingProtocols

match address 111

policy-map QOS-SCHEDULER

class-map RoutingProtocols

bandwidth percent 5

class class-default

interface port-channel 1

service-policy out QOS-SCHEDULER

Hope to help

Giuseppe

mmendis · ‎06-02-2020

Hello Giuseppe,

Your insight was very helpful thank you.

We have not deployed QoS for BGP Keepalives and OSPF Hello Packets.

I am not sure I understand what you mean by platform device type eg ASR /ISR or OS version.

In that Port channel 1 link 25 OSPF members and 32 BGP members. We have not applied any QoS protection for OSPF or BGP

Please see the below sh interface po1 output

sh interfaces po1
Port-channel1 is up, line protocol is up
Hardware is GEChannel, address is 683b.7884.22c0 (bia 683b.7884.22c0)
Description: Downlink | 0 Range Stack
Internet address is 10.10.10.1/24
MTU 1500 bytes, BW 4000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 16/255, rxload 14/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
ARP type: ARPA, ARP Timeout 04:00:00
No. of active members in this channel: 4
Member 0 : GigabitEthernet0/0/0 , Full-duplex, 1000Mb/s
Member 1 : GigabitEthernet0/0/1 , Full-duplex, 1000Mb/s
Member 2 : GigabitEthernet0/0/2 , Full-duplex, 1000Mb/s
Member 3 : GigabitEthernet0/0/3 , Full-duplex, 1000Mb/s
No. of PF_JUMBO supported members in this channel : 4
Last input 00:00:00, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 1/1500/0/2 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/160 (size/max)
30 second input rate 244395000 bits/sec, 68535 packets/sec
30 second output rate 251064000 bits/sec, 46507 packets/sec
1712265985567 packets input, 686051887172913 bytes, 0 no buffer
Received 321745649 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
2 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 138357633 multicast, 0 pause input
1190291669430 packets output, 894255286980320 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
80270 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

Show version output

uptime is 29 weeks, 4 days, 19 hours, 49 minutes
Uptime for this control processor is 29 weeks, 4 days, 19 hours, 50 minutes
System returned to ROM by reload
System restarted at 02:04:34 GMT Fri Nov 8 2019
System image file is "bootflash:asr1001x-universalk9.16.06.07.SPA.bin"
Last reload reason: Reload Command

All the Ospf neighbors went down and came up. Out of 32 BGP peers only two peers went down.

sh ip ospf neighbor detail | s Neighbor is up
Neighbor is up for 04:21:44
Neighbor is up for 04:21:44
Neighbor is up for 04:21:49
Neighbor is up for 04:21:49
Neighbor is up for 04:21:49
Neighbor is up for 04:21:44
Neighbor is up for 04:21:49
Neighbor is up for 04:21:47
Neighbor is up for 04:21:51
Neighbor is up for 04:21:52
Neighbor is up for 04:21:48
Neighbor is up for 04:21:47
Neighbor is up for 03:19:38
Neighbor is up for 03:19:38
Neighbor is up for 03:19:38
Neighbor is up for 04:21:44
Neighbor is up for 03:19:45
Neighbor is up for 04:21:48
Neighbor is up for 04:21:46
Neighbor is up for 04:21:43
Neighbor is up for 04:21:44
Neighbor is up for 04:21:49

Best Regards,

Aravinthan

Giuseppe Larosa · ‎06-03-2020

Hello @mmendis ,

>> All the Ospf neighbors went down and came up. Out of 32 BGP peers only two peers went down.

If you are using default values OSPF dead interval is 40 seconds and BGP hold time is 180 seconds

You should check the STP activity on the backbone Vlan as by sure there are LAN switches in the middle between this ASR 1000 and the other 25 OSPF neighbors.

If using PVST (not the Rapid PVST) a STP recalculation can cause an outage of up to 50 seconds that will cause OSPF dead interval to expire.

Modern design recommendations try to avoid to have a backbone VLAN with so many OSPF routers connected to the same broadcast domain.

You should verify if the LAN Switches in use are multilayer and support OSPF it could be a good thing to enable OSPF on them and to reduce the number of OSPF routers in each Vlan using multiple Vlans or even routed ports.

Of course this would be a complex migration that needs to be prepared and require several mantainance windows to complete.

In the long term can be a wise decision.

Hope to help

Giuseppe