Solved: Re: HX Replication Factor (shutdown cluster?)

Soporteco · ‎12-06-2019

Hi. we need some help understanding why HX cluster is completely shutdown when we have a failure.

Let's say we have a 4 node HX cluster, with RF2. If we remove 2 disks from different nodes, the Cluster is shutdown.

We understand that RF2 means 2 copies of the data, so we might be removing the data and its copy. The question is:

why does the whole cluster need to be shutdown? Cisco might want to prevent losing data, but why not only blocking/shutting down the affected VMs? It's hard to explain to the customer that having just 2 disks removed, might impact the whole system even if those disks only have data for 1 virtual machine.

Appreciate any comments or explanations.

Thanks!

RedNectar · ‎12-06-2019

Hi @Soporteco ,

First of all let me correct one very minor but very important point. You really need to specify

If we remove 2 disks from different nodes simultaneously, the Cluster is shutdown

So let me begin with why the word simultaneously is so important.

Because Hyperflex is a fully distributed system, (read more about that on this forum here) you need to understand that if only one disk fails, after the initial 1 minute detection period ALL the controller VMs will work as a team to rebuild the lost disk, so it will be rebuilt much faster than in many competing systems. We are talking in the order of seconds to minutes rather than hours.

So it is going to be an unlucky day that you lose the second disk on a different node before the first is rebuilt. In other words, the two simultaneous disk failures in two separate nodes is a corner case.

So now you have to make a balanced decision. Is the tiny risk of losing two disks simultaneously big enough to go for Replication Factor 3 rather than RF2? Remember Cisco has three rules when it comes to loosing customer data:

Don’t lose customer data
Don’t lose customer data
Don’t lose customer data

Which is why Cisco would recommend using RF3 in the first place.

Now on to your point about why not only blocking/shutting down the affected VMs?

Something about this comment makes me feel that some competitor has figured out a way to try and make their system look better than Hyperflex by picking this very unlikely corner case of:

Using RF 2
Loosing two disks simultaneously. Oh – and conveniently forgot the word simultaneously.

But to answer the question directly, the answer lies in the fact that Cisco gets its consistency in performance and speed by distributing the processing power AND data across multiple nodes and multiple disks within each node. This means that every VM potentially has its data distributed across every node and across every disk, so losing two disks simultaneously could affect every VM. Certainly the case you quote "even if those disks only have data for 1 virtual machine" is probably only EVER going to be true if there is only ONE VM on the entire system. If you have added a second VM, it would be VERY likely that some of the data from the second VM will be on one of the disks, especially on a small system.

But, as you imply, it would not necessarily impact EVERY VM, so Cisco COULD figure out which VMs are affected and which are not. But at what cost? How much overhead would this take? And just to be able to cater for an unusual corner case.

So my guess is that Cisco has weighed up the options and decided that the disadvantages outweigh the advantages of figuring out which VMs can keep running when there are two nodes that have simultaneous disk failures. But its only a guess.

I hope this helps

Don't forget to mark answers as correct if it solves your problem. This helps others find the correct answer if they search for the same problem

RedNectar aka Chris Welsh.
Forum Tips: 1. Paste images inline - don't attach. 2. Always mark helpful and correct answers, it helps others find what they need.

View solution in original post

RedNectar · ‎12-06-2019

Hi @Soporteco ,

First of all let me correct one very minor but very important point. You really need to specify

If we remove 2 disks from different nodes simultaneously, the Cluster is shutdown

So let me begin with why the word simultaneously is so important.

Because Hyperflex is a fully distributed system, (read more about that on this forum here) you need to understand that if only one disk fails, after the initial 1 minute detection period ALL the controller VMs will work as a team to rebuild the lost disk, so it will be rebuilt much faster than in many competing systems. We are talking in the order of seconds to minutes rather than hours.

So it is going to be an unlucky day that you lose the second disk on a different node before the first is rebuilt. In other words, the two simultaneous disk failures in two separate nodes is a corner case.

So now you have to make a balanced decision. Is the tiny risk of losing two disks simultaneously big enough to go for Replication Factor 3 rather than RF2? Remember Cisco has three rules when it comes to loosing customer data:

Don’t lose customer data
Don’t lose customer data
Don’t lose customer data

Which is why Cisco would recommend using RF3 in the first place.

Now on to your point about why not only blocking/shutting down the affected VMs?

Something about this comment makes me feel that some competitor has figured out a way to try and make their system look better than Hyperflex by picking this very unlikely corner case of:

Using RF 2
Loosing two disks simultaneously. Oh – and conveniently forgot the word simultaneously.

But to answer the question directly, the answer lies in the fact that Cisco gets its consistency in performance and speed by distributing the processing power AND data across multiple nodes and multiple disks within each node. This means that every VM potentially has its data distributed across every node and across every disk, so losing two disks simultaneously could affect every VM. Certainly the case you quote "even if those disks only have data for 1 virtual machine" is probably only EVER going to be true if there is only ONE VM on the entire system. If you have added a second VM, it would be VERY likely that some of the data from the second VM will be on one of the disks, especially on a small system.

But, as you imply, it would not necessarily impact EVERY VM, so Cisco COULD figure out which VMs are affected and which are not. But at what cost? How much overhead would this take? And just to be able to cater for an unusual corner case.

So my guess is that Cisco has weighed up the options and decided that the disadvantages outweigh the advantages of figuring out which VMs can keep running when there are two nodes that have simultaneous disk failures. But its only a guess.

I hope this helps

Don't forget to mark answers as correct if it solves your problem. This helps others find the correct answer if they search for the same problem

RedNectar aka Chris Welsh.
Forum Tips: 1. Paste images inline - don't attach. 2. Always mark helpful and correct answers, it helps others find what they need.

Soporteco · ‎12-09-2019

Thank you so much for this detailed answer RedNectar. I got your point very well, and it's true that it's very unlikely to lose 2 disks simultaneously.

Our case is just that the customer required a hyperconverged solution that could support losing one entire node or 3 disks simultaneoulsy without impacting the operation. That was the requirement. But I understand that even with RF3 (4 Node-Cluster) Hyperflex won't support 3 disk failures on 3 different nodes. We could lose 2 disks from one node, and the third one from other node, but NOT from 3 different nodes.

And then it came the question about why shutting down the whole cluster, but I guess it was a decision that made Cisco, just as you mentioned, where the priority is to avoid losing data, taking into account the point that you said: a single VM could have its data distributed across several disks which explains why losing 2 disks could impact many of them.

RedNectar · ‎12-09-2019

Hi @Soporteco ,

I strongly suspect that if your customer has specified "a hyperconverged solution that could support losing one entire node or 3 disks simultaneoulsy without impacting the operation" then they have already made up their mind, and just wasting everybody else's time by asking for other solutions. I used to work for a govenrment department, I know how these things work sometimes.

RedNectar aka Chris Welsh.
Forum Tips: 1. Paste images inline - don't attach. 2. Always mark helpful and correct answers, it helps others find what they need.