03-23-2023 07:02 AM
I have two data centres and one of them connects upstream and receives the full internet routing table this is then forwarded via iBGP to the other Data-Centre (Don't worry about if this is good practice or not, it is configuration I have inherited and can do nothing about for now).
With no change of configurations and no network changes, suddenly the holdown timers are expring and this connection is constantly up/down because of the peer resets.
Weirdly, we are able to ping devices connected to this DC but cannot pass any other traffic. Here is what I am now seeing in the error log and was hoping someone could point me in the right direction:
Mar 23 12:31:46.216: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:32:42.903: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:33:03.944: %LDP-SW1-5-SP: 192.168.1.1 :0: session recovery succeeded
Mar 23 12:33:28.085: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up
Mar 23 12:39:44.601: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:40:31.532: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:40:55.721: %LDP-SW1-5-SP: 192.168.1.1:0: session recovery succeeded
Mar 23 12:41:28.127: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up
I have confirmed the X-Connect is good and have also replaced the sfp's. I am planning on changing out the core switches as it could be hardware or an ios issue, but I am hoping I do not have to.
Thanks
Solved! Go to Solution.
03-24-2023 12:01 PM - edited 03-24-2023 12:02 PM
Hello
@CliveG wrote:Mar 23 12:33:03.944: %LDP-SW1-5-SP: 192.168.1.1 :0: session recovery succeeded
Mar 23 12:33:28.085: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up
Mar 23 12:39:44.601: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:40:31.532: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:40:55.721: %LDP-SW1-5-SP: 192.168.1.1:0: session recovery succeededLocal host: 192.168.1.1, Local port: 646
Foreign host: 192.168.1.2, Foreign port: 41632show ip bgp ipv4 unicast neighbors 192.168.1.2 | sec Last reset
Last reset 11:52:26, due to BGP Notification received, no supported AFI/SAFI
The above indicates MPLS LDP hold timers are being reached as such the mpls neighbour peering is being torn down between your router and rtr 192.168.1.2 ( isp rtr?)
Can you post the out of the following please:
access-list 110 permit udp host 192.168.1.1 eq 649 any
debug ip packet detail 110
sh mpls ldp neighbour
sh mpls ldp discovery
03-25-2023 09:44 AM
show mpls ldp neighbor - This shows the peer address within the table on the correct port-channel (I cannot place the output here as there are far too many addresses of which the majority are public ranges).
show mpls ldp discovery:
C6880-VSS-THE#show mpls ldp discovery
Local LDP Identifier:
192.168.1.1:0
Discovery Sources:
Interfaces:
Port-channel511 (ldp): xmit/recv
LDP Id: 192.168.1.4:0
Port-channel512 (ldp): xmit/recv
LDP Id: 192.168.1.3:0
Targeted Hellos:
192.168.1.1 -> 192.168.1.3 (ldp): active/passive, xmit/recv
LDP Id: 192.168.1.3:0
192.168.1.1 -> 192.168.1.4 (ldp): active/passive, xmit/recv
LDP Id: 192.168.1.4:0
03-23-2023 06:20 PM
Another possible cause, any chance the link between peers has occasional congestion such that BGP hellos are lost? (No network changes often means there's been no network configuration changes, but traffic volume changes, like some recently added data replication between DCs, might go unnoticed.)
03-24-2023 12:11 AM
Hi Joeseph,
Checked congestion and all seems okay. Nothing on the site with the issue and not a great deal on the other site. In fact, the 2nd site has many a couple more BGP peers that are connecting fine with no issues.
Thanks
03-24-2023 12:35 AM
Hello,
I have not followed the entire thread, and I hope I am not adding anything redundant, but my best guess is that with no changes to your configuration, chances are that something on the ISP side has changed. I know that Telehouse is a colocation provider...how are they involved here ? Have you had any contact with them ?
03-24-2023 02:38 AM - edited 03-24-2023 03:05 AM
03-24-2023 07:29 AM - edited 03-24-2023 09:19 AM
can you check if the DC-1 DC-2 is direct connect or the traffic pass through DC-3
check the IGP for BGP neighbor
check also via traceroute
03-24-2023 08:30 AM
Hi MHM
I inherited this network and, as you can see from the diagram, there is only the one upstream connection at DC-2 (North). As we are a streaming company. all of our data goes through this connection so I have to be extremely careful what I do, from a troubleshooting perspective, on this VSS.
The routes, with the way DC-3 is configured, all go to DC-2 first, even if we want to go to DC-1 the route is always through DC-2 (except for the iBGP peering). Effectively, we only really utilse DC-1 for load-balancing and any other systems that are there (not many, mainly backups for streaming). However, the load-balancing is now also starting to be an issue.
Strangely, as mentioned previously, we can SSH to the load balancer at DC-1 but we do not receive the SSH keys back and we note that other traffic is also not being passed over this VSS at DC-1. The only reason ping and traceroute work correctly is because of the MTU of 707.
I have e-mailed our third party upstream provider between DC-3 to DC-2 and DC-3 to DC-1 as they have to initiate a re-connection about 3 weeks ago and this is the only thing that has changed since the LAN functioning correctly and the LAN failing.
DC-1 is failing on all iBGP connectivity. To DC-2, to LoNAP and to DC-3.
03-24-2023 09:55 AM
Freind troubleshoot before we know that link is accpet only 707 is different than after that.
This value is so low which even not equal defualt 1500(1460)
This value is low which can the same size you config for CoPP in VSS.
Just check traceroute if the next-hop is bgp peer or not.
03-25-2023 09:39 AM
Traceroute seems to show next hop as:
1 192.168.1.2 [MPLS: Label 197 Exp 0] 28 msec 0 msec 52 msec
2 192.168.1.5 4 msec * 0 msec
03-25-2023 10:22 AM
There is not direct connect' I see additional hops
What is this hop? Is it one VSS DC ?
03-24-2023 10:53 AM
I too was wondering, after OP mentioned WAN provider line drop and restoration, whether that might be the root of the new problem, especially if the timings of the two coincide.
03-26-2023 04:43 AM
the IGP is effect iBGP, the IGP is change when the link down and IGP select path through the DC2
as I mention before check the IGP and your share of traceroute and your previous comment confirm my theory.
the issue of packet pass through the DC 1or2
here the issue CoPP can accept specific packet size and rate,
the CoPP is drop the BGP and hence the BGP flapping always
the solution you must check the IGP. if you solve IGP then the BGP will solve automatic.
03-27-2023 07:39 AM
I would like to thank you all for you help with this issue. It is greatly appreciated.
As mentioned, an update. I visited Telehouse this morning and completed a Hard reboot of the VSS. With the same configs etc, everything is now working as it should.
Everyone helped, but to close this I have to pick someone and accept their resolution. Please accept my apologies everyone else but, again, please note that I am thankful for all of your assistance with this.
03-27-2023 08:16 AM
BTW, FYI, if you so chose, I believe you can select multiple postings as solutions. Further, I believe you don't have to select any posting as a solution.
I also believe, you can tag any posting as helpful too, including threads you didn't start.
I mention this, because to relatively new users, the forgoing might not be obvious.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide