03-25-2024 01:44 AM - edited 03-25-2024 01:44 AM
Hello everyone,
Currently seeking assistance in identifying the issue.
We have two ASR 9006 routers located at two different sites. Both routers are running OSPF between them, utilizing multiple sub-interfaces, with each sub-interface assigned to a different VLAN. Each VLAN is routed through a separate private line.
Diagram for reference:
Interface Table for reference:
R1 Bundle1 (2x100GE) | R2 Bundle1 (1x100GE) | Private Line |
BE1.100 100.64.0.1/30 | BE1.100 100.64.0.2/30 | PL1 |
BE1.200 100.64.0.5/30 | BE1.200 100.64.0.6/30 | PL2 |
BE1.300 100.64.0.9/30 | BE1.300 100.64.0.10/30 | PL3 |
The problem:
OSPF is reported as down. Point-to-point IPs are not pingable on all interfaces of R2. We attempted to check for logs directly related to link flaps on ports facing private lines and routers on both switches; however, unfortunately, no logs or any abnormalities of any sort were found. We suspected an issue with the third-party private line provider. To verify the issue between the private lines, both switches were fortunately configured with SVIs (VLAN 5 passing through PL2). Every time OSPF goes down and point-to-point IPs at R2 are not pingable, we are able to ping and SSH into the switch at Site B. Therefore, it is safe to assume that there are no issues with the private lines (though there is still a possibility). Additionally, there were no logs relating to link flaps on the switch at Site B.
The next suspected issue could be hardware-related. It might involve HGE0/0/0/0 on R2, which has two 100GEs. BE1 is on HGE0/0/0/0. We configured a BE2.5 on HGE0/0/0/1 to test if there's an issue with HGE0/0/0/0. The same configuration was also applied to R1 but on a 10GE. There was no OSPF configuration on both sub-interfaces, only point-to-point IPs. However, OSPF went down again. The newly configured sub-interfaces are not pingable. Upon checking the logs on the routers, we only found logs indicating that the neighbors went from FULL to DOWN
RP/0/RSP0/CPU0:Mar 22 08:26:20.771 UTC: ospf[1029]: %ROUTING-OSPF-5-ADJCHG : Process 100, Nbr <R1_Loopback> on Bundle-Ether1.100 in area 0 from FULL to DOWN, Neighbor Down: dead timer expired, vrf default vrfid 0x60000000
RP/0/RSP0/CPU0:Mar 22 08:26:20.885 UTC: ospf[1029]: %ROUTING-OSPF-5-ADJCHG : Process 100, Nbr <R1_Loopback> on Bundle-Ether1.200 in area 0 from FULL to DOWN, Neighbor Down: dead timer expired, vrf default vrfid 0x60000000
RP/0/RSP0/CPU0:Mar 22 08:26:20.852 UTC: ospf[1029]: %ROUTING-OSPF-5-ADJCHG : Process 100, Nbr <R1_Loopback> on Bundle-Ether1.300 in area 0 from FULL to DOWN, Neighbor Down: dead timer expired, vrf default vrfid 0x60000000
We have not found any logs prior to these events that we can correlate with.
This issue occurred once before but had not happened again until recently, and it now appears to be prevalent. The downtimes occur randomly.
We rarely make changes to the configurations on these routers.
We also checked for CPU and memory spikes, but all returned minimal and normal readings after the event. We did not find any logs either.
Currently, these are our assumptions regarding why this is happening:
We are currently at a standstill. Do you have any ideas on what to check next? Are there any important commands we should use to diagnose the issue? Any suggestions would be greatly appreciated.
Best regards,
Blitz
03-25-2024 02:25 AM
Hello
If you cannot ping the directly connected p2p peer address then its a reachability issue, possibly at a lower level.
03-25-2024 07:08 AM
Hi @paul driver,
Thank you for your reply.
- How are your peering- do you have a single ospf adjacency or does each sub-interface have one?
Each sub-if have ospf adjacency. They all go down at the same time.
- Do you have L1/2 connectivity when this fails (are the physical interfaces up/up)?
Yes, no issues in between router-switch. Interfaces are up/up. No link flaps. Router to router pings fails, while switch to switch pings succeeds.
- When this happened last time, what did you do to rectify the issue?
We have not rectified nor identified the issue yet. Just speculations. Ospf is down for almost a minute, and then goes back up again. On that day, time frames were really random.
- When the ospf adjacency fails what is the exact ospf state of that neighbour adjacency(s)
Ospf state adjacency goes from FULL to Down, according to the logs.
- Have you tried a debug to capture the tear down.?
We haven't really looked into capturing debug yet. We're currently waiting for it to happen again in order to capture debugs from the router. Last event was already four days ago.
Best regards,
Blitz
03-25-2024 08:18 AM
This need EEM run'
Event will be ospf neighbor down
Action show ip interface breif
Action send syslog or email
It can some drop in ISP link or flapping.
MHM
03-25-2024 11:49 PM
Thank you for your reply.
To be honest I haven't heard of EEM until now.
Did look into it and we might be able to try this in case it happens again.
Best Regards,
Blitz
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide