Connection to gateway randomly lost

Martijn de Loos · ‎11-02-2018

Hello,

I am having an odd problem in a client's network and it is causing big issues. Please see the (simple) star topology below:

5x Cisco Small Business Switch SG220-50

1x Fortinet FortiWifi 60D firewall

A whole bunch of desktops and printers and servers

The problem we are having is that at very random times, no consistency whatsoever, internal clients lose connectivity to only the gateway which is at x.x.x.1. When this happens the entire office loses their internet connection. All internal resources such as servers and printers are still available and reachable, except the gateway.

When this problem occurs I ran an infinite ping -t to the gateway's IP and what I saw is intermittent replies and timeouts. I thought, because only the gateway is affected, that there would be a machine in the network assuming the gateway's IP address and so causing an IP conflict, but when checking the arp on a computer and checking the MAC address table on the switches, I do not see anything conflicting. Also, when I disconnect the internal interface of the firewall from the network, all pings timeout so there is no other device in the network that is assuming the gateway's IP address.

Now here comes the weird part I cannot explain. While working on this issue I was convinced there was a device in the network causing this. I disconnected cables one by one from the switches and then at some point the connectivity to the gateway is restored. After tracing the cable to the specific workstation I found a computer in sleep mode, so it wasn't even on. I turned it on and did an ipconfig. It had a normal IP address from the DHCP pool. Anyway, the connectivity to the gateway was restored and I called it a night. The next day the office's connection ran perfectly fine until the end of the day. Then the issue started occurring again. To fix it I had to do the exact same thing, but this time the connection got restored after disconnecting different cables on another switch. Again when tracing the cable to a workstation, there is no IP conflict on the computer. Also, after disconnecting the cables and the connection is restored to the gateway, I reconnected the workstations to the switch and everything was still working fine. However, the connection to the gateway keeps going down randomly and the only way to fix it is by disconnecting cables from the switches. I can't figure out what is going on and the times it happens is randomly and also every time I have to disconnect different cables in order to fix the problem.

Also, when this problem occurs I tried connecting my laptop straight into the inside interface of the fortinet firewall and that was working perfectly fine so I do not think the problem is caused by the firewall.
What can be the issue here?

Any help is greatly appreciated.

Georg Pauwen · ‎11-02-2018

Hello,

a few initial thoughts:

Make sure th DHCP pool excludes the IP address of the default gateway. Also, since the Fortigate seems to be the exit point for the Internet, do you see anything in the logs ?

Also, make sure all the SG switches are running the latest firmware...currently release 1.1.4.1

Martijn de Loos · ‎11-02-2018

Hi Georg,

Thanks for your reply. All the switches run the latest firmware you mentioned. The DHCP pool goes from .10 to .200, so .1 - .9 are outside of the scope. These are for the firewall and the servers.
I checked the fortigate logs and even ran debug logging on it while it happened and it doesn't show me anything. As soon as the connection is restored by disconnecting end station cables, the firewall doesn't tell me anything either to point out what the issue was.

Georg Pauwen · ‎11-02-2018

Hello,

I don't know what you already did, so I might just mention to check the uptime of the switches and maybe even the Fortigate. Did you reboot all devices ?

Georg Pauwen · ‎11-02-2018

You could also try and change the uplink port from the SG220 to the Fortigate...

Martijn de Loos · ‎11-02-2018

The switches got rebooted 2 days ago when I upgraded the firmware. After the reboot, the connection got restored until the next day. It seems that everytime the end devices get disconnected and reconnected to the network, that fixes the issue temporarily...

For the Fortinet, I replaced the entire device with a spare they had on the shelf. Restored the config on that device and replaced the production firewall. When the issue occurs, I connected the internal interface of the firewall to multiple ports to multiple switches (one at a time of course) but that didn't do anything either.

Martijn de Loos · ‎11-05-2018

So far everything has been up and running fine since Friday morning. It happened once more on Friday in the early morning (after upgrading the firmware on all the switches on Thursday night). The guy on-site did an arp -a command on a workstation to check for incorrect or duplicate entries. There was none and the MAC address was that of the firewall. After he issued the arp -a command on the workstation, the connection was stable again which I cannot explain what an arp -a has got to do with fixing this issue.

No outages on Friday or over the weekend and still doing fine so far on Monday morning, but I still don't have peace of mind on it as I still don't know what the root cause of the problem is. I feel like it is just a matter of time before it'll happen again.

Georg Pauwen · ‎11-05-2018

Hello Martijn,

I wonder what happens if you set a permanent ping (ping -t) from one of the workstations to the Fortigate; sort of a surrogate keepalive (the SG switches don't have that option) ?

Martijn de Loos · ‎11-05-2018

I did this on one of the servers. It has been running since Friday morning. Had a little over a million pings with 0 packets lost. I will leave it running.
The SG switches can do pings but only a maximum of 65535 pings and it is done within a browser session to the switch.

paul driver · ‎11-05-2018

Hello

@Martijn de Loos wrote:

I can't figure out what is going on and the times it happens is randomly and also every time I have to disconnect different cables in order to fix the problem.

Also, when this problem occurs I tried connecting my laptop straight into the inside interface of the fortinet firewall and that was working perfectly fine so I do not think the problem is caused by the firewall.
What can be the issue here?

Any help is greatly appreciated.

1) Is it possible you are exceeding your maximum concurrent registered internet user allocation on the firewall?

2) you may be experiencing a intermittent loop in you network, one possible way to find the source would be to initiate extended pings from some users and at the same time from the core switch individually (one at a time) disconnect/reconnect a uplink to an access closet, if at that time the ping establishes connection then you have a starting point to where this possible loop is occurring, then it would be just a matter of doing the same test downstream until you find the switch/host port that it causing the problem

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Martijn de Loos · ‎11-05-2018

Thanks Paul. Really appreciate your help.

My thoughts were a loop in the network as well. So what I did last week is disconnecting cables from switch 1 one by one. At some point the connection to the gateway was restored, so I traced the last cable I unplugged and it appeared to be a computer in sleep mode. I turned it on and checked ipconfig. It had a normal IP address and there was nothing else but that PC connected to that switchport.

The next day the issue occurred again, so I disconnected the same computer from switch 1 but this time it didn't resolve the problem. In fact, none of the ports on switch 1 fixed it at that time. So I moved on to switch number 2, disconnected cables one by one and again the issue got resolved at some point. This kept happening on and on night after night and every time it was a different workstation I had to disconnect in order to resolve the issue. Upon checking the workstations I don't see anything wrong with them.

As of Friday morning, all devices in the network have been connected and so far we haven't had a single issue, so I really don't see any difference compared to a week ago, that's why I fear it will happen again at some point in time.

I checked the firewall for the maximum concurrent users and it appears we have not hit the limit (right now all the devices are online in the network).

paul driver · ‎11-05-2018

Hello

I think this is going to be a mater of discovery, as it does indeed sound like you have a issue with some device looping, do these clients have wifi/wired capability at the same time? - is it possible a client is using some sort of bridging via their network cards?

Do your switches have redundant interconnects to the core, as you only have 5 switches you could check what port should be in a forwarding state and what should be blocking as a baseline and then at the time of outage check again.

Apply some port security like bpduguard/port security maximum/violation/storm control broadcast/multicast on the access ports and make sure you dont have bpdu filtering enabled where you shouldn't have, and Disable any unused ports.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Martijn de Loos · ‎11-05-2018

Thanks. I will check it out if it happens again.
The switches are connected to the coreswitch with LACP etherchannels. 2 ports per etherchannel.

paul driver · ‎11-05-2018

Hello Martin

Well then may I suggest you also apply udld and loopguard features,,,, udld monitors physical unidirectional failures and loopguard detects logical failures.

One thing i forgot to mention DONT enable any error recovery so you can capture any failure regards the features i have mentioned.

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Martijn de Loos · ‎11-07-2018

I have been monitoring the network for the past couple of days. We had zero issues for 5 consecutive days, but last night the disruptions to the gateway came back again. With lots of delays and disconnects I was able to connect to the Cisco switches. From there I checked the logs. I had even turned on debug logging a while back but again, the logs did not show me any information that could be related to this issue. I then started to shutdown ports one by one. At some point I found the port that was causing the issue. As soon as I shut it down, the connection became stable again. When I re-enabled the port, the issue returned. I did this a couple of times back and forth to confirm it is indeed a device behind this connection causing the disruptions. I then left the port off until the next morning.

This morning we physically traced the cable and there was a small 6-port D-Link switch behind it with 2 computers and a printer connected to it. The computers and printer had no conflicting network settings. While we were checking the computers, strangely enough the disruption happened again. We then physically disconnected the D-Link switch from the wall jack and the connection became stable again, even though the switchport was shutdown in the switch itself, so I don't know why this made a difference.

Only an hour later the disruption to the gateway returned again and that D-Link switch + the computers and printer were still physically disconnected from the network. Again, the switch logs didn't tell me anything, but even the coreswitch, where the inside interface of the firewall is directly connected to, was unable to ping the gateway. I checked the MAC table and confirmed that the IP address and MAC address of the firewall were associated to the right switchport on the coreswitch. Moments later the connection became stable again while we didn't do anything.

I'm running out of ideas and just cannot explain why that particular port last night caused the issues (I switched it on and off to confirm the problem came from that port and it seemed it did) and this morning it seems to be coming from somewhere else again. And again, it is only the connection to the gateway that is having issues. Any other device in the LAN can be reached without an issue. We already replaced the firewall for a spare unit and that did not fix the issue either.
As these SG220 switches are brand new and the issues came up only a few days after installing them, I am considering switching back to the old switches and see how that goes. I'm running out of options with how inconsistent this problem is.