cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
6307
Views
6
Helpful
44
Replies

Strange BGP issue

CliveG
Level 1
Level 1

I have two data centres and one of them connects upstream and receives the full internet routing table this is then forwarded via iBGP to the other Data-Centre (Don't worry about if this is good practice or not, it is configuration I have inherited and can do nothing about for now).

With no change of configurations and no network changes, suddenly the holdown timers are expring and this connection is constantly up/down because of the peer resets.

Weirdly, we are able to ping devices connected to this DC but cannot pass any other traffic. Here is what I am now seeing in the error log and was hoping someone could point me in the right direction:

Mar 23 12:31:46.216: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:32:42.903: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:33:03.944: %LDP-SW1-5-SP: 192.168.1.1 :0: session recovery succeeded
Mar 23 12:33:28.085: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up
Mar 23 12:39:44.601: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated
Mar 23 12:40:31.532: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:40:55.721: %LDP-SW1-5-SP: 192.168.1.1:0: session recovery succeeded
Mar 23 12:41:28.127: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up

I have confirmed the X-Connect is good and have also replaced the sfp's. I am planning on changing out the core switches as it could be hardware or an ios issue, but I am hoping I do not have to.

Thanks

44 Replies 44

Hello


@CliveG wrote:

Mar 23 12:33:03.944: %LDP-SW1-5-SP: 192.168.1.1 :0: session recovery succeeded
Mar 23 12:33:28.085: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going up
Mar 23 12:39:44.601: %LDP-SW1-5-SP: 192.168.1.1:0: session hold up initiated

Mar 23 12:40:31.532: %MSDP-SW1-5-PEER_UPDOWN: Session to peer 192.168.1.1 going down
Mar 23 12:40:55.721: %LDP-SW1-5-SP: 192.168.1.1:0: session recovery succeeded

Local host: 192.168.1.1, Local port: 646
Foreign host: 192.168.1.2, Foreign port: 41632

show ip bgp ipv4 unicast neighbors 192.168.1.2 | sec Last reset
Last reset 11:52:26, due to BGP Notification received, no supported AFI/SAFI 


The above indicates MPLS LDP hold timers are being reached as such the mpls neighbour peering is being torn down  between your router and rtr 192.168.1.2 ( isp rtr?)

Can you post the out of the following please:

access-list 110 permit udp host 192.168.1.1 eq 649 any 
debug ip packet detail 110

sh mpls ldp neighbour 
sh mpls ldp discovery 


Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

show mpls ldp neighbor - This shows the peer address within the table on the correct port-channel (I cannot place the output here as there are far too many addresses of which the majority are public ranges).

show mpls ldp discovery:

C6880-VSS-THE#show mpls ldp discovery
Local LDP Identifier:
192.168.1.1:0
Discovery Sources:
Interfaces:
Port-channel511 (ldp): xmit/recv
LDP Id: 192.168.1.4:0
Port-channel512 (ldp): xmit/recv
LDP Id: 192.168.1.3:0
Targeted Hellos:
192.168.1.1 -> 192.168.1.3 (ldp): active/passive, xmit/recv
LDP Id: 192.168.1.3:0
192.168.1.1 -> 192.168.1.4 (ldp): active/passive, xmit/recv
LDP Id: 192.168.1.4:0

Joseph W. Doherty
Hall of Fame
Hall of Fame

Another possible cause, any chance the link between peers has occasional congestion such that BGP hellos are lost?  (No network changes often means there's been no network configuration changes, but traffic volume changes, like some recently added data replication between DCs, might go unnoticed.)

Hi Joeseph,

Checked congestion and all seems okay. Nothing on the site with the issue and not a great deal on the other site. In fact, the 2nd site has many a couple more BGP peers that are connecting fine with no issues.

Thanks

Hello,

I have not followed the entire thread, and I hope I am not adding anything redundant, but my best guess is that with no changes to your configuration, chances are that something on the ISP side has changed. I know that Telehouse is a colocation provider...how are they involved here ? Have you had any contact with them ?

Let me draw a quick diagram and attach here.

As mentioned previously, if the configs are also required then please let me know.

can you check if the DC-1 DC-2 is direct connect or the traffic pass through DC-3
check the IGP for BGP neighbor 
check also via traceroute 

 

1.jpg

Hi MHM

I inherited this network and, as you can see from the diagram, there is only the one upstream connection at DC-2 (North). As we are a streaming company. all of our data goes through this connection so I have to be extremely careful what I do, from a troubleshooting perspective, on this VSS.

The routes, with the way DC-3 is configured, all go to DC-2 first, even if we want to go to DC-1 the route is always through DC-2 (except for the iBGP peering). Effectively, we only really utilse DC-1 for load-balancing and any other systems that are there (not many, mainly backups for streaming). However, the load-balancing is now also starting to be an issue.

Strangely, as mentioned previously, we can SSH to the load balancer at DC-1 but we do not receive the SSH keys back and we note that other traffic is also not being passed over this VSS at DC-1. The only reason ping and traceroute work correctly is because of the MTU of 707.

I have e-mailed our third party upstream provider between DC-3 to DC-2 and DC-3 to DC-1 as they have to initiate a re-connection about 3 weeks ago and this is the only thing that has changed since the LAN functioning correctly and the LAN failing.

DC-1 is failing on all iBGP connectivity. To DC-2, to LoNAP and to DC-3.

Freind troubleshoot before we know that link is accpet only 707 is different than after that.

This value is so low which even not equal defualt 1500(1460)

This value is low which can the same size you config for CoPP in VSS.

Just check traceroute if the next-hop is bgp peer or not.

Traceroute seems to show next hop as:

1 192.168.1.2 [MPLS: Label 197 Exp 0] 28 msec 0 msec 52 msec
2 192.168.1.5 4 msec * 0 msec

There is not direct connect' I see additional hops

What is this hop? Is it one VSS DC ?

I too was wondering, after OP mentioned WAN provider line drop and restoration, whether that might be the root of the new problem, especially if the timings of the two coincide.

BGP-Issue.jpg

the IGP is effect iBGP, the IGP is change when the link down and IGP select path through the DC2 
as I mention before check the IGP and your share of traceroute and your previous comment confirm my theory. 
the issue of packet pass through the  DC 1or2
here the issue CoPP can accept specific packet size and rate, 
the CoPP is drop the BGP and hence the BGP flapping always 

the solution you must check the IGP. if you solve IGP then the BGP will solve automatic. 

CliveG
Level 1
Level 1

I would like to thank you all for you help with this issue. It is greatly appreciated.

As mentioned, an update. I visited Telehouse this morning and completed a Hard reboot of the VSS. With the same configs etc, everything is now working as it should.

Everyone helped, but to close this I have to pick someone and accept their resolution. Please accept my apologies everyone else but, again, please note that I am thankful for all of your assistance with this.

BTW, FYI, if you so chose, I believe you can select multiple postings as solutions.  Further, I believe you don't have to select any posting as a solution.

I also believe, you can tag any posting as helpful too, including threads you didn't start.

I mention this, because to relatively new users, the forgoing might not be obvious.