When I arrived onsite the behavior we saw was user connectivity issues that was spread across multiple switches. I asked staff about the last known changes and they mentioned a vendor has likely pulled power to the UPS and caused an entire switch outage.
I initially thought there was some unsaved config that was lost and started to compare the config of working ports vs non-working ports. Nothing was really seen here that was out of the ordinary.
We found instances where a user would not get connectivity on port 18 of a given switch but could be moved to port 19 of a switch and work fine with the same config on both ports. During this time it was noticed the Spanning tree root on some vlans was on the .2 switch and the other half of the vlans was on .1. All of STP roots should be in the same location as the L3 vlan gateways. We corrected this but we did not see a change in behavior.
From here, I wanted to confirm that the user PC settings or hardware was not having issues, so I isolated the user to a “dumb hub” and gave that user a static IP address to rule the user hardware out. This test was successful and we rules out the user hardware.
I plugged the user back into the switch and created a local L3 vlan on the switch that the user was connected to. This new L3 vlan was reachable from other points in the network but you could not ping the user from this local vlan on the same switch (ping attempted from switch with source cmd).
After this, we provisioned a spare switch that the customer had onsite and started to migrate the non-working users to this switch. Every user that was moved to the new switch connected immediately and their network issues ceased at this point. ISE was initially thought to be a culprit but at this point I do not believe this to be the case. From here, I will add the ISE config to the unused ports on the Temp switch that was added Friday and have IT staff test those ports with the ISE config. I suspect this will work fine and we may be looking at a hardware failure but I would like to confirm that users on the Temp switch continue to have a good experience after ISE config is added as the current state of port config on the temp switch lacks ISE configuration.
Has anyone ever seen a switch fail and act like this? I have a reluctance to say it is a switch failure because this behavior was seen on 3 different switches. However, I keep leaning back on that given that every user that was moved to the new switch worked fine.