11-02-2018 09:13 AM - edited 03-08-2019 04:32 PM
Hello,
I am having an odd problem in a client's network and it is causing big issues. Please see the (simple) star topology below:
5x Cisco Small Business Switch SG220-50
1x Fortinet FortiWifi 60D firewall
A whole bunch of desktops and printers and servers
The problem we are having is that at very random times, no consistency whatsoever, internal clients lose connectivity to only the gateway which is at x.x.x.1. When this happens the entire office loses their internet connection. All internal resources such as servers and printers are still available and reachable, except the gateway.
When this problem occurs I ran an infinite ping -t to the gateway's IP and what I saw is intermittent replies and timeouts. I thought, because only the gateway is affected, that there would be a machine in the network assuming the gateway's IP address and so causing an IP conflict, but when checking the arp on a computer and checking the MAC address table on the switches, I do not see anything conflicting. Also, when I disconnect the internal interface of the firewall from the network, all pings timeout so there is no other device in the network that is assuming the gateway's IP address.
Now here comes the weird part I cannot explain. While working on this issue I was convinced there was a device in the network causing this. I disconnected cables one by one from the switches and then at some point the connectivity to the gateway is restored. After tracing the cable to the specific workstation I found a computer in sleep mode, so it wasn't even on. I turned it on and did an ipconfig. It had a normal IP address from the DHCP pool. Anyway, the connectivity to the gateway was restored and I called it a night. The next day the office's connection ran perfectly fine until the end of the day. Then the issue started occurring again. To fix it I had to do the exact same thing, but this time the connection got restored after disconnecting different cables on another switch. Again when tracing the cable to a workstation, there is no IP conflict on the computer. Also, after disconnecting the cables and the connection is restored to the gateway, I reconnected the workstations to the switch and everything was still working fine. However, the connection to the gateway keeps going down randomly and the only way to fix it is by disconnecting cables from the switches. I can't figure out what is going on and the times it happens is randomly and also every time I have to disconnect different cables in order to fix the problem.
Also, when this problem occurs I tried connecting my laptop straight into the inside interface of the fortinet firewall and that was working perfectly fine so I do not think the problem is caused by the firewall.
What can be the issue here?
Any help is greatly appreciated.
11-02-2018 09:28 AM
Hello,
a few initial thoughts:
Make sure th DHCP pool excludes the IP address of the default gateway. Also, since the Fortigate seems to be the exit point for the Internet, do you see anything in the logs ?
Also, make sure all the SG switches are running the latest firmware...currently release 1.1.4.1
11-02-2018 09:45 AM
11-02-2018 09:50 AM
Hello,
I don't know what you already did, so I might just mention to check the uptime of the switches and maybe even the Fortigate. Did you reboot all devices ?
11-02-2018 09:56 AM
You could also try and change the uplink port from the SG220 to the Fortigate...
11-02-2018 10:02 AM - edited 11-02-2018 10:06 AM
The switches got rebooted 2 days ago when I upgraded the firmware. After the reboot, the connection got restored until the next day. It seems that everytime the end devices get disconnected and reconnected to the network, that fixes the issue temporarily...
For the Fortinet, I replaced the entire device with a spare they had on the shelf. Restored the config on that device and replaced the production firewall. When the issue occurs, I connected the internal interface of the firewall to multiple ports to multiple switches (one at a time of course) but that didn't do anything either.
11-05-2018 10:13 AM - edited 11-05-2018 10:23 AM
So far everything has been up and running fine since Friday morning. It happened once more on Friday in the early morning (after upgrading the firmware on all the switches on Thursday night). The guy on-site did an arp -a command on a workstation to check for incorrect or duplicate entries. There was none and the MAC address was that of the firewall. After he issued the arp -a command on the workstation, the connection was stable again which I cannot explain what an arp -a has got to do with fixing this issue.
No outages on Friday or over the weekend and still doing fine so far on Monday morning, but I still don't have peace of mind on it as I still don't know what the root cause of the problem is. I feel like it is just a matter of time before it'll happen again.
11-05-2018 10:30 AM
Hello Martijn,
I wonder what happens if you set a permanent ping (ping -t) from one of the workstations to the Fortigate; sort of a surrogate keepalive (the SG switches don't have that option) ?
11-05-2018 12:53 PM
11-05-2018 11:49 AM - edited 11-05-2018 11:50 AM
Hello
@Martijn de Loos wrote:
I can't figure out what is going on and the times it happens is randomly and also every time I have to disconnect different cables in order to fix the problem.
Also, when this problem occurs I tried connecting my laptop straight into the inside interface of the fortinet firewall and that was working perfectly fine so I do not think the problem is caused by the firewall.
What can be the issue here?
Any help is greatly appreciated.
1) Is it possible you are exceeding your maximum concurrent registered internet user allocation on the firewall?
2) you may be experiencing a intermittent loop in you network, one possible way to find the source would be to initiate extended pings from some users and at the same time from the core switch individually (one at a time) disconnect/reconnect a uplink to an access closet, if at that time the ping establishes connection then you have a starting point to where this possible loop is occurring, then it would be just a matter of doing the same test downstream until you find the switch/host port that it causing the problem
11-05-2018 12:50 PM
11-05-2018 01:22 PM - edited 11-05-2018 01:24 PM
Hello
I think this is going to be a mater of discovery, as it does indeed sound like you have a issue with some device looping, do these clients have wifi/wired capability at the same time? - is it possible a client is using some sort of bridging via their network cards?
Do your switches have redundant interconnects to the core, as you only have 5 switches you could check what port should be in a forwarding state and what should be blocking as a baseline and then at the time of outage check again.
Apply some port security like bpduguard/port security maximum/violation/storm control broadcast/multicast on the access ports and make sure you dont have bpdu filtering enabled where you shouldn't have, and Disable any unused ports.
11-05-2018 01:27 PM
11-05-2018 01:29 PM - edited 11-05-2018 01:31 PM
Hello Martin
Well then may I suggest you also apply udld and loopguard features,,,, udld monitors physical unidirectional failures and loopguard detects logical failures.
One thing i forgot to mention DONT enable any error recovery so you can capture any failure regards the features i have mentioned.
11-07-2018 09:07 AM