with an 8 node HX cluster and LAZ enabled, 4 zones/2 nodes per zone, how many node failures can be tolerated in order to keep the quorum? If the majority of servers must be online than only 3 nodes could fail, right?
Or is it different with LAZ enabled?
Hi @smailmilak ,
The answer is one of those "It depends" answers.
If you have two nodes in the same zone fail, it is no worse than having one fail. In fact you can loose ALL the servers in one zone with the loss of onl one copy of any data - should you be so lucky.
With replication Factor 3, you could loose ALL the servers in TWO zones, and you'd still have one good copy of all the data somewhere.
So, in your case with 8 servers, you could potentially loose up to four servers and till have a good copy of all the data, so long as those four servers were contained to two LAZs. On the other hand, you could loose three servers in three different LAZs, and you'd be in read-only mode.
The idea of LAZ is to increase your chances of NOT loosing data if multiple servers fail. If you have four servers, data is stiped across four servers. If you have 4 Zones, data is striped across 4 zones, so if yo uare lucky enough to loose two servers in the same zone (and statistically you will be 50% of the time) then the hit is less painful.
I hope this helps
Don't forget to mark answers as correct if it solves your problem. This helps others find the correct answer if they search for the same problem
thanks a lot for replying. I'm aware of LAZ and for what is it for, 2 copies are never in the same zone, etc. but it's still unclear if the majority is necessary. Without LAZ we can lose 2 nodes (5+ nodes with RF3) and 3 are remaining, meaning that we have the quorum.
With 8 nodes/4 zones if we lose 2 full zones we will have only 4 out of 8 and this is not the majority, right? In this case, the cluster will go offline?
HI @smailmilak ,
OK. I think your problem is that you know too much about distributed databases and the concept of a quorum. In Hyperflex there is no quorum concept because, unlike a database that will have each shard replicated three times, each VM is split into many stripe units, which are distributed evenly across the cluster - and each stripe unit is replicated three times.
Without LAZ, the loss of data on any three nodes (be it disk drive or complete node) will put the system offline, because potentially at least one stripe unit of one VM is no longer available.
With LAZ, those three losses have to be across three different Zones. Two losses in the same zone mean at most one one stripe unit of any VM has been lost - in fact the entire zone can be lost and at most one one stripe unit of any VM has been lost.
So it's not about how many nodes we need to form a quorum, but about how many nodes can we loose without the potential of loosing data. And without LAZ, it makes no difference if you have 5 or 30 nodes, you can't tolerate three losses. With LAZ, you can't tolerate losses in three different zones.
the truth is that I don't work with databases at all, but with HX. A quorum is needed for standard HX cluster, that's why we can lose only 1 node if we have 4 nodes (RF3).
With the new 2-Node HX Edge we must have a witness that is actually running on Intersight.
I'm not sure how it is with LAZ in that case. Probably the same, meaning that majority of servers must be UP.
True, Cisco do use the word "quorum" in regards to the two-node Hyperflex Edge, but that it the only place I've seen Cisco use that term in relation to Hyperflex, and is appropriate becuase the stripe units for ALL VMs are present on each node. We could use the same logic and probably refer to a quorum with a three node cluster using RF3 as well.
When it comes to LAZ, you need at least 8 nodes before LAZ kicks in. With less than 8 nodes, you can consider a cluster with n nodes to have n Logical Availability Zones, each with one node.
In either case, loose a drive or a whole node or even a whole zone, three strikes and you are out. But the stikes MUST be in different zones.