Connectivity issues inside vlans after HSRP refresh

athomas1 · ‎06-15-2022

So, I remove the fibre cable which our HSRP protocol runs over between our two layer3 stacks, in order to untangle some cat5e cables that was straining the LC plug.

Since then we've had various and seemingly random issues with certain addresses not being able to contact other IP addresses- but within the same vlan? Between vlans there appears to be no problems. We were able to fix one issue where computers in a dhcp vlan couldnt contact a server on a static address in the same dhcp vlan, but when those computers were moved to a different dhcp vlan they computers were able to connect again. Other computers in the previous dhcp vlan were still able to contact everything as normal as though nothing had changed.

We've also had issues with server hosts in another vlan on static addresses becoming segmented where the communication appears to run over this fibre.

One theory we have is that when the fibre was removed HSRP started recalculating which should take 30seconds but then fibre was plugged back in again after about 10seconds, so whether the recalculation wasn't able to finish and a second recalculation was started, it has now caused problems with routing internal to certain vlans without any indication of this on the logs.

They are 3750X switches which are now unsupported and so support direct from cisco is doubtful. I have got the new 9300 stacks ready to be installed however I am waiting for a suitable opportunity/ period of quiet production before swapping them as it will be disconnecting an entire server room from the network.

Any help hugely appreciated with this one!!!

MHM Cisco World · ‎06-15-2022

within same VLAN, Hmm
let me explain here something
if the VLAN is found in both HSRP Peer then the traffic need not routing but bridging
what that meaning ?
you need L2 between two Peer..!!!
Yes L2 you need
for example PC1 connect to HSRP Peer1 is send arp ask mac of PC2 in same VLAN but connect to different HSRP Peer2, the HSRP Peer1 not routing the arp it bridging the arp to HSRP Peer2 and from there to PC2.
this explain why you have issue when disconnect link and use fiber.

athomas1 · ‎06-15-2022

I'm sorry I don't fully understand your response. Can you better explain please?

MHM Cisco World · ‎06-15-2022

your post
""Since then we've had various and seemingly random issues with certain addresses not being able to contact other IP addresses- but within the same vlan""

there is some IP with some MAC is not connect to each other even they in same VLAN
Yes same VLAN but are they connect to same Access SW?

you need L2 interconnect between two AGG SW (HSRP Peer) to server the same VLAN connect to two different access SW.

Seb Rupik · ‎06-15-2022

Hi there,

HSRP routers communicate using the multicast address 224.0.0.2 which is not routable, therefore HSRP peers must have a Layer2 link between them to function. If you have removed the fibre which carries these VLANs between the two L3 switches then you will have created two separate broadcast domains for each VLAN. Devices on the same VLAN connected to a different L3 switch will not be able to communicate with each other as a result.

You can you confirm the HSRP status and lack of L2 data path between routers with the command show standby . This will most likely show that the standby router is unknown.

cheers,

Seb.

athomas1 · ‎06-15-2022

In our topology we have a layer2 ring going around our site, with the layer3 switches at either side of the ring and the fibre I refer to being removed goes directly between the layer3. I assume from the syslog messages i encountered that the hsrp runs over this fibre as opposed to between the l2 switches around the ring until the other l3 switch is reached.

When I removed the fibre going directly between the two layer3 switches the whole l2 ring was intact and still able to pass traffic so there should have been no separate broadcast domains. Previously we've had two physical fibres between the l3 switches for redundancy but currently only 1 hence when removing it it obviously caused problems, but not in a way that we expected.

I have done some analysis with my counterparts in our parent company using show standby and checking root bridge status of the vlans and so far everything has come back as it should do. Like i said, most things are working fine. The issues we have experienced have been specific to two vlans, one static addressing, and the other two a mix of static and dynamic and no testing between any of them has highlighted any common factors.

Seb Rupik · ‎06-15-2022

If you have reachability between the HSRP routers demonstrated via show standby, and both L3 switches are showing the same STP root bridge for the VLANs in question then the L2 domain is indeed intact.

Does the 'ring' between the two L3 switches go via multiple switches? For a pair of hosts which are in the same VLAN and cannot communicate are you able find their MAC addresses in the MAC tables on all of the switches? In particular along the path through the switches between them.

cheers,

Seb.

athomas1 · ‎06-15-2022

All of the VLans are showing the correct root bridge. Also if I change the master for the server hosts across to the other server room the blades which are currently reporting being in a segemented network then work again, so something that was communicating over the direct fibre link now isn't, but everything that has been checked is showing in a correct/ working state.

Regarding the ring, there are multiple switches around both sides of the ring. We have done checks on the MAC tables with a faulty computer at the time that was having the comms issues and from it's local access switch right round to the layer3 the mac of the device was present in the table. By by example, with the situation we've had there could be devices that couldn't communicate and devices that could on the same access switch and taking the same route round the ring to the layer3.

One question I wanted to try and answer was whether reconnecting the fibre link before hsrp has finished it's recalculation would cause any issues and whether removing said link should cause any issues with routing or cause any unseen issues with HSRP?

I have the new 9300 switch stack ready to be installed which I could try, but we're very reluctant to try swapping over to new switches when the current setup is faulty.

Hope that all makes sense.

Seb Rupik · ‎06-15-2022

Devices connected to the same switch on the same VLAN not being able to communicate sounds like a bug to me. Did the MAC address table on the switch show that it had learnt the device MAC addresses on the correct ports?

Regarding the fibre link, when that was removed, this would have caused STP to reconverge. From the sound of your topology this link would have been a Root port for one of the L3 switches, with a blocked link for the VLANs lurking somewhere else around your network 'ring'. STP can reconverge around links being pulled and reinstated, so I doubt this would have caused a permanent problem. As for HSRP once STP had reconverged around the loss of the fibre link the routers would have been able to exchange Hello messages once again.

Are you able to share the config of a switch which had the problem with two directly attached devices communicating? Please indicate which switchports, switch model and software version.

cheers,

Seb.

athomas1 · ‎06-15-2022

From what I remember of the analysis we did on the MAC tables ( it was 2 weeks ago now) the MACs wer ebeing detected on the correct ports.

With the STP convergence, after a link is pulled and the convergence starts does you have to wait until this has finished before reinstating the link and triggering another convergence, or can you plug straight back in again. How does it work exactly, does reinstating a link before STP has finished converging cause problems, or doesn't it matter?

The topology hasn't changed, and nearly everything is working fine again. But we still have specific devices inside of two vlans which cannot communicate properly, with the devices on the one vlan reporting a network segmentation, suggesting to me the vlan believes the link wasn't reinstated and it needs this specific link to function correctly? I'm just trying to make sense of which could cause such issues...

Would rebooting the Layer3 switch in question where the network segments devices connect through?

Would manually flipping root bridges from primary to secondary and then back again work?

Seb Rupik · ‎06-15-2022

Hi there,

Each time there is a topology change an BPDU with the TN bit will be sent from the bridge which first detected the change will be sent for 2x the hello interval (30 seconds), this is the 'TN While' interval. Everytime a switch receives these BPDUs it will clear its MAC table. As these BPDUs are sent through the network each bridge will forward their BPDUs with the TN bit set for the 30 second interval. If a switch had a flapping none-edge port then this would keep resetting the TN While timer, sending another ripple of BPDUs with the TN bit set, which would prevent any MAC learning across the network. Eventually if you stop pulling links the network would resume normal operation!

How do your two device report network segmentation? I suspect they look at their own ARP table and if they cannot see their peers MAC address then they assume the network is segmented. I would be tempted to run a packet capture on both devices and see if the ARP requests and replies are being passed between them.

If the output of show spanning-tree vlan xx on both switches where these devices are connected shows the same root bridge ID then which connecting links are up and down is academic (unless a link with high congestion is being used and traffic is being dropped!), you have forwarding path between the switches.

cheers,

Seb.

Mark Elsen · ‎06-15-2022

- Use and or configure a (central) syslog-server on all involved switches, examine logs on the syslog-server afterwards and or look for patterns determining a possible cause.

M.

-- Let everything happen to you
   Beauty and terror
      Just keep going
     No feeling is final
Reiner Maria Rilke (1899)

athomas1 · ‎06-15-2022

Unfortunately at this point I think the log entries from this event have been overwritten as I am unable to access them from the switch console. I have tried but the log is only giving a limited amount of readback.

In this situation I have been away for the last week and the problems have not been repaired in this time and the logs appear to have now been overwritten.

MHM Cisco World · ‎06-15-2022

We all agree that l2 is issue here,

Can you try

Traceroute mac and find where traceroute is stop what which SW.

From there we can start deep investigate.

athomas1 · ‎06-15-2022

I have attempted to run a tracert from my pc to one of the server blades (on differnet vlans) and tracert failed.

Does the tracert need running in this case from one of the devices that cannot connect to another device in the same vlan, or does it not matter and it just needs to be from one network device to another that isn't connecting properly?