09-29-2021 09:52 AM - edited 09-29-2021 09:54 AM
Pretty sure this is a server issue but posting here as there is a lot of knowledge on these forums.
3 hypervisors (Hyper-V - server 2016) with a backend storage setup. Each hypervisor has a teamed pair of NICs on frontend with a vswitch and then 2 x 1Gbps NICs for storage.
management subnet - 192.168.5.128/27
storage1 subnet - 192.168.5.192/27
storage2 subnet - 192.168.5.224/27
ToR switches - 2 x 3750 in a stack - 12.2(55).SE10 which we have on multiple switches no issues.
Hypervisor team has 2 NICs one to each switch in stack then storage1 NIC to switch 1 in stack and storage2 NIC to switch 2 in stack.
Each hypervisor has an IP out of each subnet and default gateway is 192.168.5.129. The Storage networks have no default gateway in TCP/IP properties.
So this has been working for over a year and then today two of the hypervisors lost connectivity to each other and to the storage server on the storage2 subnet. I checked the switches and verified all ports in the right vlans, plus all ports had the correct mac address of the corresponding NIC on the server.
This is the weird part -
From a hypervisor I did a traceroute to the other hypervisor on the storage2 network and -
1) the first time it came back with a destination host unreachable and that was it
2) the second time I ran it and every time since it has tried to reach the same IP via the default gateway (192.168.5.129). This makes no sense to me. The storage2 NIC is up and the routing table of the hypervisor shows the storage2 subnet in the routing table and on-link so why would it even try the default gateway.
It is as though because the server could not reach the other hypervisor on the storage2 NIC it ruled that NIC out (even though it is up) and just tries the default gateway now.
At the moment I am trying to rule out any issues with the switches but the stack is purely L2 with 3 vlans and some etherchannels, that is as basic as you can get. As I say the mac address table on the switch is correct and checks out with the mac addresses of ipconfig /all on each hypervisor.
All this time the 3rd hypervisor can still connect to the storage server but not the other two hypervisors.
Has anyone seen anything like this before or perhaps spot anything obvious I may have missed.
Jon
09-29-2021 11:19 AM
Hello Jon,
in the end you are probably going to have to reinstall both Hypervisors, but what if you un-team the NICs (so that only one is left and active) ?
09-29-2021 11:39 AM
Hi Georg
The storage NICs are not teamed and they are the issue, not the team which is the front facing side ie. management and client traffic.
Interestingly though we had an issue a couple of weeks back with the same cluster and connectivity issues and this was on the team side and I fixed it by undoing the team on the two hypervisors that lost connectivity (same two as now).
We are going to totally rebuild cluster but I am just trying to rule out any issues with the switches as we want to use another 3750 stack with same IOS as we have other setups where this works fine so I would really like to get to the bottom of this.
I am currently adding static arp entries to hypervisors and running packet captures on servers to see what is happening.
Jon
09-29-2021 12:14 PM
As a follow up to this ran some packet captures on the hypervisor NICs and traffic is seen leaving the storage2 NIC on the first hypervisor but nothing is received on the storage2 NIC on the second hypervisor.
So it looks like will have to do some port mirroring on the switch stack tomorrow.
Jon
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide