Solved: Lost connection with unit * reason: connection timeout. Unit will...

schuster · ‎03-19-2024

Hello,

since a few weeks, we are notice a lot of timeouts in our network. We figured out that there are unplanned reboots in our stack. This is what the stack is logging:

We are using the stack in a ring-topology in hybrid-mode with the firmware 2.5.5.47 and the following models:

- SG550XG-8F8T
- SG550X-24MP
- SG550X-24
- SG550X-24

Is there any way to figure out why the unit was rebooted? Do you need any other informations?

Thanks in advance.

schuster

thomas-o · ‎11-06-2024

For anyone interested, I had a very similar issue, but with C1300 switches.

I discovered that it had to do with the gateway in the same subnet, you may check this https://quickview.cloudapps.cisco.com/quickview/bug/CSCwe47566.

And a new bug should come up there soon : https://quickview.cloudapps.cisco.com/quickview/bug/CSCwn12314

View solution in original post

pieterh · ‎03-19-2024

1) what has changed "since a few weeks" ?

2) your output "server certificate validation failed" may indicate that a certificate installed on one of the switches has expired ?
3) the root cause may not be the reboot, but the member was rebooted as result of the connection timeout
4) "connection lost" may indicate the ring is not closed, and the other stack link disconnected

5) you may have grown out of the limits of hybrid Mode

If your environment is a lightly loaded network backbone, Hybrid Mode may be a perfectly acceptable solution for best flexibility and physical changes to the stack.
Hybrid Mode forces internal resources to be allocated using the least common denominator. For example, if one model has a MAC table size of 16K and you have another, higher-performance model with a MAC table size of 64K, a stack using Hybrid Stack Mode will force the operating stack to 16K, thus limiting the availability of additional internal hardware resources.

schuster · ‎03-19-2024

Thanks for your reply pieterh! I replied to your questions below:

1)

since we try to cleanup the network, we changed a lot of things:

- we set the STP-priority to 0 on the stack in our server room (which is the affected one and which is connected to the other switches). The other switches have higher priorities. Before this change, all switches had the same priority and the mac decided

- we disabled EEE, PnP and two links which are creating a loop

- we configured a default gateway on the stack which is pointing to our firewall

- we implemented a Sophos XGS (before, there was a "stupid" bintec router in place)

- we increased the ram log level to debug since we hope to get more information what the stack is doing

- we configured a remote log server (which is pointing to a local hosted graylog instance)

- we unplugged and plugged in again all of the stacking-cables (apart from that, we doesnt changed the hardware)

2)

- the "Error: server certificate validation failed"-message appears since we enabled tls interception on our Sophos. We excluded the switches from the tls interception, since this change, I don't see the message again. Both certificates are valid until next year. It seems that the switch is using OCSP or something similar?

3 / 4)

- at the time of check, all links are active:

We now will monitor these links to ensure that they are 24/7 up.

5)

I measured the entries in "show mac address-table" and got ~250 results. This is expected since we have ~50 employees. The Sophos show ~300 macs in his table (the difference of ~50 should be ok since the stack doesn't see the guest wifi clients). I think the the SG550XG should be fine with this "low level load", or am I missing something?

I'm open to all suggestions. Let me know if I can provide you any more details about our configuration which may helps to analyze.

The next thing we planned is a reboot of all stack-members at the same time. I will do that on next saturday.

Another thing I have noticed is the following. Is it expected that the master doesn't have a uplink-port (I executed show stack)?

Thanks in advance.

schuster

pieterh · ‎03-19-2024

take a look at Configure Stack Settings on a Switch through the CLI - Cisco
show stack
shows an uplink port on the master

maybe
show stack links [details]
gives some more information of what is wrong?

>>> we set the STP-priority to 0 on the stack in our server room<<<
is this a different stack ?
if not check if the STP root is also the master of the stack.

schuster · ‎03-19-2024

The result of show stack links [details] looks ok from my perspective (but I will check this view again when we receive the problems again) :

STP - no, this is the same stack where we noticed the reboots. I checked the STP root bridge info on the other switches, they all have the current master from the stack as root bridge id configured.

thomas-o · ‎10-09-2024

Hello,

Have you found a workaround to this issue ?

I am also facing a similar issue with a stack of 2x C1300 24XT. The Unit 2 keeps rebooting and after the unit 2 gets rebooted, the unit 1 stops responding to pings and stops forwarding traffic.

I noticed in the things you did, that you set a default gateway pointing to your firewall, I also did that on my network and the issue started happening after setting (and using) the switches as the gateway..

schuster · ‎03-21-2024

I also found a Cisco Bug Report which describes our situation: CSCvu51887 : Bug Search Tool (cisco.com)

But unfortunately, the bug is "Terminated". and it doesn't fit to our firmware-version... any suggestions how to continue with this error?

thomas-o · ‎11-06-2024

For anyone interested, I had a very similar issue, but with C1300 switches.

I discovered that it had to do with the gateway in the same subnet, you may check this https://quickview.cloudapps.cisco.com/quickview/bug/CSCwe47566.

And a new bug should come up there soon : https://quickview.cloudapps.cisco.com/quickview/bug/CSCwn12314

schuster · ‎03-06-2025

Thanks for your link, it seems to be the issue here. Since I segmented the network, the traffic get routed into another vlan and ip-network and the cpu utilization is now low and no more timeouts in the network.