03-19-2024 02:39 AM
Hello,
since a few weeks, we are notice a lot of timeouts in our network. We figured out that there are unplanned reboots in our stack. This is what the stack is logging:
We are using the stack in a ring-topology in hybrid-mode with the firmware 2.5.5.47 and the following models:
- SG550XG-8F8T
- SG550X-24MP
- SG550X-24
- SG550X-24
Is there any way to figure out why the unit was rebooted? Do you need any other informations?
Thanks in advance.
schuster
03-19-2024 06:53 AM - edited 03-19-2024 06:59 AM
1) what has changed "since a few weeks" ?
2) your output "server certificate validation failed" may indicate that a certificate installed on one of the switches has expired ?
3) the root cause may not be the reboot, but the member was rebooted as result of the connection timeout
4) "connection lost" may indicate the ring is not closed, and the other stack link disconnected
5) you may have grown out of the limits of hybrid Mode
03-19-2024 08:09 AM - edited 03-19-2024 08:20 AM
Thanks for your reply pieterh! I replied to your questions below:
1)
since we try to cleanup the network, we changed a lot of things:
- we set the STP-priority to 0 on the stack in our server room (which is the affected one and which is connected to the other switches). The other switches have higher priorities. Before this change, all switches had the same priority and the mac decided
- we disabled EEE, PnP and two links which are creating a loop
- we configured a default gateway on the stack which is pointing to our firewall
- we implemented a Sophos XGS (before, there was a "stupid" bintec router in place)
- we increased the ram log level to debug since we hope to get more information what the stack is doing
- we configured a remote log server (which is pointing to a local hosted graylog instance)
- we unplugged and plugged in again all of the stacking-cables (apart from that, we doesnt changed the hardware)
2)
- the "Error: server certificate validation failed"-message appears since we enabled tls interception on our Sophos. We excluded the switches from the tls interception, since this change, I don't see the message again. Both certificates are valid until next year. It seems that the switch is using OCSP or something similar?
3 / 4)
- at the time of check, all links are active:
We now will monitor these links to ensure that they are 24/7 up.
5)
I measured the entries in "show mac address-table" and got ~250 results. This is expected since we have ~50 employees. The Sophos show ~300 macs in his table (the difference of ~50 should be ok since the stack doesn't see the guest wifi clients). I think the the SG550XG should be fine with this "low level load", or am I missing something?
I'm open to all suggestions. Let me know if I can provide you any more details about our configuration which may helps to analyze.
The next thing we planned is a reboot of all stack-members at the same time. I will do that on next saturday.
Another thing I have noticed is the following. Is it expected that the master doesn't have a uplink-port (I executed show stack)?
Thanks in advance.
schuster
03-19-2024 09:03 AM
take a look at Configure Stack Settings on a Switch through the CLI - Cisco
show stack
shows an uplink port on the master
maybe
show stack links [details]
gives some more information of what is wrong?
>>> we set the STP-priority to 0 on the stack in our server room<<<
is this a different stack ?
if not check if the STP root is also the master of the stack.
03-19-2024 11:40 PM
The result of show stack links [details] looks ok from my perspective (but I will check this view again when we receive the problems again) :
STP - no, this is the same stack where we noticed the reboots. I checked the STP root bridge info on the other switches, they all have the current master from the stack as root bridge id configured.
10-09-2024 12:04 AM
Hello,
Have you found a workaround to this issue ?
I am also facing a similar issue with a stack of 2x C1300 24XT. The Unit 2 keeps rebooting and after the unit 2 gets rebooted, the unit 1 stops responding to pings and stops forwarding traffic.
I noticed in the things you did, that you set a default gateway pointing to your firewall, I also did that on my network and the issue started happening after setting (and using) the switches as the gateway..
03-21-2024 03:19 AM
I also found a Cisco Bug Report which describes our situation: CSCvu51887 : Bug Search Tool (cisco.com)
But unfortunately, the bug is "Terminated". and it doesn't fit to our firmware-version... any suggestions how to continue with this error?
11-06-2024 06:06 AM
For anyone interested, I had a very similar issue, but with C1300 switches.
I discovered that it had to do with the gateway in the same subnet, you may check this https://quickview.cloudapps.cisco.com/quickview/bug/CSCwe47566.
And a new bug should come up there soon : https://quickview.cloudapps.cisco.com/quickview/bug/CSCwn12314
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide