OSPF bouncing ... troubleshooting issue on 6807

rikdrt1 · ‎11-03-2021

Hi ,

i have a pair of 6807's in VSL mode/pair with two routers on the upstream in a typical failover config...

ASR1 goes to SW1 and ASR2 goes to SW2 in the core-pair.

not sure when, but about a month ago we started to see drops during the day... first noticed by our cisco VoIP users since the reconvergence of OSPF causes it to fail and drop the connection to the CM that is at a remote site/across the WAN.

after some considerable troubleshooting, at first we tackled the Layer1, then looked at maybe issues with loops or L2 problems but were back to the same spot again.

every 30-200minutes.. very random, i get the same OSPF drop.. and then it re-establishes the connection after a few seconds it seems.

as a side note, i do notice some HOSTFLAPPING ... on the standard switchports but not on the WAN uplinks to the OSPF neighbor. it might just be coincidence , but just something i noticed and that too i can't seem to pinpoint. but the OSPF drops are definitely causing issue. Any help would be appreciated of course. Thanks.

Nov 3 21:29:45 PDT: %OSPF-SW1-5-ADJCHG: Process 1, Nbr 10.0.0.65 on GigabitEthernet2/3/1 from FULL to DOWN, Neighbor Down: Dead timer expired
Nov 3 21:29:45 PDT: %OSPF-SW1-5-ADJCHG: Process 1, Nbr 10.1.0.65 on GigabitEthernet1/4/1 from FULL to DOWN, Neighbor Down: Dead timer expired
Nov 3 21:29:45 PDT: %OSPF-SW1-5-ADJCHG: Process 1, Nbr 10.2.7.36 on GigabitEthernet1/3/1 from FULL to DOWN, Neighbor Down: Dead timer expired
Nov 3 21:29:46 PDT: %OSPF-SW1-5-ADJCHG: Process 1, Nbr 10.2.9.65 on GigabitEthernet2/3/1 from LOADING to FULL, Loading Done
Nov 3 21:29:46 PDT: %OSPF-SW1-5-ADJCHG: Process 1, Nbr 10.2.7.36 on GigabitEthernet1/3/1 from LOADING to FULL, Loading Done
Nov 3 21:29:46 PDT: %OSPF-SW1-5-ADJCHG: Process 1, Nbr 10.2.7.65 on GigabitEthernet1/4/1 from LOADING to FULL, Loading Done

Georg Pauwen · ‎11-04-2021

Hello,

on the interfaces that are configured, do you see any packet drops (sh interfaces x) ?

You might want to turn on:

debug io ospf adj

and post the results here...

rikdrt1 · ‎11-04-2021

i enabled to see what happens... thanks.

rikdrt1 · ‎11-04-2021

didnt think of that. here is what i have of the 4 OSPF links.

i guess i should look at that one with thousands of drops.

should i clear the counters of each of the interfaces ? is that just clear counters intx/x ?

i checked the other side of 1/4/1, which is the router .. and it doesnt show drops but instead has lots of unknown protocol drops.

i guess i best clear it all out , since that might be months/yrs old now. but there certainly is a pattern on that one link.

--------------

GigabitEthernet0/0/2 is up, line protocol is up
Hardware is 6XGE-BUILT-IN, address is 00d7.8fa5.8702 (bia 00d7.8fa5.8702)
Description: SW2-G1-4-1
Internet address is 10.10.0.157/30
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive not supported
Full Duplex, 1000Mbps, link type is auto, media type is SX
output flow-control is on, input flow-control is on
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:01, output 00:00:13, output hang never
Last clearing of "show interface" counters never
Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
30 second input rate 2523000 bits/sec, 659 packets/sec
30 second output rate 7794000 bits/sec, 1251 packets/sec
4005640361 packets input, 3466425004245 bytes, 0 no buffer
Received 6 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 11116591 multicast, 0 pause input
5841044750 packets output, 4168885168986 bytes, 0 underruns
0 output errors, 0 collisions, 2 interface resets
297304 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

-------------

Neighbor ID Pri State Dead Time Address Interface
10.9.7.36 1 FULL/BDR 00:00:03 10.10.0.153 GigabitEthernet2/4/1
10.3.7.65 1 FULL/BDR 00:00:03 10.10.0.149 GigabitEthernet2/3/1
10.2.7.65 1 FULL/BDR 00:00:03 10.10.0.157 GigabitEthernet1/4/1
10.2.3.36 1 FULL/BDR 00:00:03 10.10.0.145 GigabitEthernet1/3/1

int gi2/4/1 | i Input queue Input queue: 0/75/0/1 (size/max/drops/flushes); Total output drops: 0
#sh int gi2/3/1 | i Input queue Input queue: 0/75/0/2 (size/max/drops/flushes); Total output drops: 0
#sh int gi1/4/1 | i Input queue Input queue: 0/75/1025/39 (size/max/drops/flushes); Total output drops: 0
#sh int gi1/3/1 | i Input queue Input queue: 0/75/0/3 (size/max/drops/flushes); Total output drops: 0

Georg Pauwen · ‎11-04-2021

Hello,

the 'unknown protocol' drops are typically caused by stuff like DTP. How are your trunks set up ?

Actually, if you can post the full configs of all 4 devices, that would make troubleshooting easier...

paul driver · ‎11-04-2021

Hello

i wouldn’t read to much into those unknowns drops - as most probably dtp and that can be negated via - switchport nonegotiate however fist t thing is to clear the interface counters as they have never been cleared as such you may be looking at historical information

Post the output of the debug ospf adj and confirm also how you are peering with those rtrs ?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

rikdrt1 · ‎11-04-2021

not sure why i am not getting any output from the debug OSPF ... ?

sh debug ospf ?

paul driver · ‎11-05-2021

Hello

@rikdrt1 wrote:

not sure why i am not getting any output from the debug OSPF ... ?

sh debug ospf ?

You may not have logging enble for either console or monitor, try the following:

Terminal monitor < Remotley connected to device

conf t
Logging console < Physcally connected to device

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

paul driver · ‎11-04-2021

Hello

@rikdrt1 wrote:

ASR1 goes to SW1 and ASR2 goes to SW2 in the core-pair.

If those rtrs are in a vpc pairing then this could be the issue, you should not have vpc l3 towards rtrs, vpc s l2 feature for loop prevention, connection to rtrs with an IGP should be via indivdual links then you would have correct ecmp load balancing.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

balaji.bandi · ‎11-04-2021

At this it is hard to say, You need to look at all the device logs, is the OSPF going all same time, as per the Logs it all going down and coming up same time (since you confirmed no Physical issue)

is the VSS configured NSF with OSFP - please refer to the below document :

https://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Campus/VSS30dg/campusVSS_DG/VSS-dg_appa-configs.html

how is your OSPF config Layer 3 Physical Interface or Layer 3 SVI ?

If this happening after some time, i am sure there is some toplogy change - which causing the issue. (setup some syslog server capture the logs to identify the issue - before OSPF go down, you may see some other issue to co-related the issue)

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

rikdrt1 · ‎11-04-2021

it is a standard L3 physical interface.. .

heres a representation... what is weird is this was working for about 5yrs.. but lately something happened so i have been chasing ghosts it seems .. i definitely need to learn to use debug better..

rikdrt1 · ‎11-04-2021

thanks for that info. we didnt have any topology change recently, and we do have tight control over what changes and where.

first thing i thought of is maybe someone plugged in an unauthorized switch somewhere but we have been going thru all the access switches to eliminate that also. very strange.

balaji.bandi · ‎11-05-2021

one of your post

#sh int gi1/4/1 | i Input queue Input queue: 0/75/1025/39 (size/max/drops/flushes); Total output drops: 0

I would suggest having the OSPF interface point-to-point here. if you are not peering mesh.

can you post running config of OSPF and Interface config both the side.

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

rikdrt1 · ‎11-05-2021

SW side..

router ospf 1
router-id 10.2.15.4
auto-cost reference-bandwidth 100000
redistribute static metric-type 1 subnets
passive-interface default
no passive-interface GigabitEthernet1/3/1
no passive-interface GigabitEthernet1/4/1
no passive-interface GigabitEthernet2/3/1
no passive-interface GigabitEthernet2/4/1
network 10.0.0.0 0.255.255.255 area 0

SW Gi1/4/1

interface GigabitEthernet1/4/1
description ASR2 G2
no switchport
ip address 10.10.0.158 255.255.255.252
ip ospf hello-interval 1
load-interval 30
service-policy input LAN-CoS-Ingress
end

Router OSPF

router ospf 1
router-id 10.2.79.6
auto-cost reference-bandwidth 100000
redistribute static metric-type 1 subnets
passive-interface default
no passive-interface GigabitEthernet0/0/0
no passive-interface GigabitEthernet0/0/1
no passive-interface GigabitEthernet0/0/2
no passive-interface GigabitEthernet0/0/3

Router-INT

interface GigabitEthernet0/0/2
description SW2-G1-4-1
bandwidth 1000000
ip address 10.10.0.157 255.255.255.252
ip ospf hello-interval 1
load-interval 30
negotiation auto
cdp enable

ALL THE OTHER 3 interfaces are basically the same setup .. different IP. super simple and like i said i set this up about 6yrs ago and all was working fine until recently and i really can't tell what changed .. which is why this is so confusing . setup is relatively simple and i have this same thing in other buildings .. same 4 uplinks to dual cores'....