We had an outage when 2 of our 30 domains managed by a Central instance, became "ungrouped" in Central at the same time, causing VSANs and VLANs configured on Central to become unresolvable, taking SAN and network down on all hosts in these domain (18 hosts, 300+ VMs).
To fix
- We moved the domains back to their respective domain groups.
- Since the VSANs were all configured on Central (they are referenced by global vHBA templates and SPs), when they became unresolvable on the domain, FC ports (we are using standalone FC ports, not FC port channels) were switched back to Default VSAN 1. We had to manually change all VSANs to the production ones referenced by our SPs' vHBA templates.
The majority of SP faults recovered (and ESXi hosts became responsive again). - A handful of hosts (3 of 18, all from one domain) did not recover. We tried disabling and resetting vHBAs without success. When we rebooted the hosts, they could not find their boot LUN (and during the BIOS/UEFI phase the vHBAs did not log into their front-end ports). SAN team confirmed settings on their end. In the end, disassociating and re-associating the SPs from the blades, restored the boot from SAN functionality.
Some more background
- We are running Central 2.0(1v), planning an update to 2.0(1w) shortly. I went through all the resolved caveats and open caveats on the Central release notes and can't find anything
- Domain infra and blade firmware is 4.2(3h).
- Domains were in different domain groups before becoming ungrouped. There is nothing common about these domains, other than this Central instance, and domains/SPs being managed by referencing some of the same policies in Central. Other domains in the same domain groups were unaffected.
- ESXi 7.0 U3s with the relevant Cisco vendor add-on and certified nfnic & nenic as per the UCS HCL.
- Hosts are booting from SAN.
We logged an SR with Cisco TAC yesterday, but it is still early days i.t.o. the log analysis.
Has anyone ever experienced something similar?