Solved: Re: APIC-Cluster fail - Scalabilityl imit

Heinz Kern · ‎03-06-2023

Hello, we have a APIC-cluster of 5 nodes. my question if 2 fail:

will the remaining ones taleover the shards so that the cluster is read-write or will i be in read-only mode?
what happens to the scalability limit? will there be issues with switches or is it just a soft limit (meaning that i also could have 200 switches with 3 apics but no official support)

hope it is clear

br + thx

Robert Burns · ‎03-07-2023

Correct. Shard distribution doesn't change unless the cluster size changes. If you have a failure of any controller, the expectation is that it should be coming back into operation. There's also the concern about scale support. The only reason to extend beyond 3 APICs is to accomodate scale increase. If all shards were replicated to remaining controllers in a larger cluster, it could quickly exceed the supported/tested scale limits for the controller. This also saves unnecessary DB shuffling activity. If you change the cluster size, then the lost shards will be replicated from the remaining copy on the master, but the fabric scale would be decreased also.

Replacing a failed controller with a standby would help you regain full R/W operations as you would be restored to a majority across all shard replicas.

Robert

View solution in original post

Robert Burns · ‎03-06-2023

If you have a 5 node cluster set and 2 nodes fail, then some data shards will go into R/O mode. You'll know this if you try and make a config change to some objects in the config and it throws a cluster health fault when you attempt to submit them. Other objects which are spread across 2 or more remaining controllers will remain fully R/W. You can verify shard health by looking at 'acidiag rvread'. RV = Replica Vector.
Scalability doesn't change, you're still configured for 5 nodes, regardless how many have failed, so the scale limits will remain the same.
Robert

Heinz Kern · ‎03-07-2023

ok..if i understand you well by default even if we have 3 remaining working controllers they do not take over the shards that just have one shard active?

furthermore a standby controller would help because i can replace a failed one and get r/w again?

Robert Burns · ‎03-07-2023

Correct. Shard distribution doesn't change unless the cluster size changes. If you have a failure of any controller, the expectation is that it should be coming back into operation. There's also the concern about scale support. The only reason to extend beyond 3 APICs is to accomodate scale increase. If all shards were replicated to remaining controllers in a larger cluster, it could quickly exceed the supported/tested scale limits for the controller. This also saves unnecessary DB shuffling activity. If you change the cluster size, then the lost shards will be replicated from the remaining copy on the master, but the fabric scale would be decreased also.

Replacing a failed controller with a standby would help you regain full R/W operations as you would be restored to a majority across all shard replicas.

Robert