ASR 9006 | Identifying an issue

blitzperf · ‎03-25-2024

Hello everyone,

Currently seeking assistance in identifying the issue.

We have two ASR 9006 routers located at two different sites. Both routers are running OSPF between them, utilizing multiple sub-interfaces, with each sub-interface assigned to a different VLAN. Each VLAN is routed through a separate private line.

Diagram for reference:

Interface Table for reference:

R1 Bundle1 (2x100GE)	R2 Bundle1 (1x100GE)	Private Line
BE1.100 100.64.0.1/30	BE1.100 100.64.0.2/30	PL1
BE1.200 100.64.0.5/30	BE1.200 100.64.0.6/30	PL2
BE1.300 100.64.0.9/30	BE1.300 100.64.0.10/30	PL3

The problem:

OSPF is reported as down. Point-to-point IPs are not pingable on all interfaces of R2. We attempted to check for logs directly related to link flaps on ports facing private lines and routers on both switches; however, unfortunately, no logs or any abnormalities of any sort were found. We suspected an issue with the third-party private line provider. To verify the issue between the private lines, both switches were fortunately configured with SVIs (VLAN 5 passing through PL2). Every time OSPF goes down and point-to-point IPs at R2 are not pingable, we are able to ping and SSH into the switch at Site B. Therefore, it is safe to assume that there are no issues with the private lines (though there is still a possibility). Additionally, there were no logs relating to link flaps on the switch at Site B.

The next suspected issue could be hardware-related. It might involve HGE0/0/0/0 on R2, which has two 100GEs. BE1 is on HGE0/0/0/0. We configured a BE2.5 on HGE0/0/0/1 to test if there's an issue with HGE0/0/0/0. The same configuration was also applied to R1 but on a 10GE. There was no OSPF configuration on both sub-interfaces, only point-to-point IPs. However, OSPF went down again. The newly configured sub-interfaces are not pingable. Upon checking the logs on the routers, we only found logs indicating that the neighbors went from FULL to DOWN

RP/0/RSP0/CPU0:Mar 22 08:26:20.771 UTC: ospf[1029]: %ROUTING-OSPF-5-ADJCHG : Process 100, Nbr <R1_Loopback> on Bundle-Ether1.100 in area 0 from FULL to DOWN, Neighbor Down: dead timer expired, vrf default vrfid 0x60000000
RP/0/RSP0/CPU0:Mar 22 08:26:20.885 UTC: ospf[1029]: %ROUTING-OSPF-5-ADJCHG : Process 100, Nbr <R1_Loopback> on Bundle-Ether1.200 in area 0 from FULL to DOWN, Neighbor Down: dead timer expired, vrf default vrfid 0x60000000
RP/0/RSP0/CPU0:Mar 22 08:26:20.852 UTC: ospf[1029]: %ROUTING-OSPF-5-ADJCHG : Process 100, Nbr <R1_Loopback> on Bundle-Ether1.300 in area 0 from FULL to DOWN, Neighbor Down: dead timer expired, vrf default vrfid 0x60000000

We have not found any logs prior to these events that we can correlate with.

This issue occurred once before but had not happened again until recently, and it now appears to be prevalent. The downtimes occur randomly.

We rarely make changes to the configurations on these routers.

We also checked for CPU and memory spikes, but all returned minimal and normal readings after the event. We did not find any logs either.

Currently, these are our assumptions regarding why this is happening:

It could be a hardware issue with the router.
It could be an issue with the private line.
It could be a configuration issue.

We are currently at a standstill. Do you have any ideas on what to check next? Are there any important commands we should use to diagnose the issue? Any suggestions would be greatly appreciated.

Best regards,

Blitz

paul driver · ‎03-25-2024

Hello
If you cannot ping the directly connected p2p peer address then its a reachability issue, possibly at a lower level.

How are your peering- do you have a single ospf adjacency or does each sub-interface have one?
Do you have L1/2 connectivity when this fails (are the physical interfaces up/up)?
When this happened last time, what did you do to rectify the issue?
When the ospf adjacency fails what is the exact ospf state of that neighbour adjacency(s)
Have you tried a debug to capture the tear down.?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

blitzperf · ‎03-25-2024

Hi @paul driver,

Thank you for your reply.

How are your peering- do you have a single ospf adjacency or does each sub-interface have one?

Each sub-if have ospf adjacency. They all go down at the same time.

Do you have L1/2 connectivity when this fails (are the physical interfaces up/up)?

Yes, no issues in between router-switch. Interfaces are up/up. No link flaps. Router to router pings fails, while switch to switch pings succeeds.

When this happened last time, what did you do to rectify the issue?

We have not rectified nor identified the issue yet. Just speculations. Ospf is down for almost a minute, and then goes back up again. On that day, time frames were really random.

When the ospf adjacency fails what is the exact ospf state of that neighbour adjacency(s)

Ospf state adjacency goes from FULL to Down, according to the logs.

Have you tried a debug to capture the tear down.?

We haven't really looked into capturing debug yet. We're currently waiting for it to happen again in order to capture debugs from the router. Last event was already four days ago.

Best regards,

Blitz

MHM Cisco World · ‎03-25-2024

This need EEM run'

Event will be ospf neighbor down

Action show ip interface breif

Action send syslog or email

It can some drop in ISP link or flapping.

MHM

blitzperf · ‎03-25-2024

Hi @MHM Cisco World

Thank you for your reply.

To be honest I haven't heard of EEM until now.

Did look into it and we might be able to try this in case it happens again.

Best Regards,

Blitz