Solved: Re: Multiple disk failure in APIC

Abhishek Nanda · ‎05-10-2022

One of the APIC from a cluster of 5 has its two disks failed. I would like to know the details of the storage. I can see from the CIMC there are two raid volumes. One is on raid-1 and the other is on raid-0. While Im looking at the other APICs the boot variable is set to true on raid-1. Should I copy the configuration of other APICs such as PD 1,2 as VD-1(raid-1)(boot), PD 3 as VD-2(raid-0). Here PD 1 is SSD and 2&3 are HDDs. When I received the replaced disk

I tried to install OS keeping boot option once on different VDs but it was stuck at the boot-up process. Now I got both the disks replaced. Can anyone bring some light how does this work. And how do shards replicate in a 5 controller cluster.

Sergiu.Daniluk · ‎05-11-2022

Hi @Abhishek Nanda

yes you can check the states of the shards/replicas.

The basic commands are:

acidiag rvread 
acidiag rvread <service>
acidiag rvread <service> <shard> 
acidiag rvread <service> <shard> <replica>

Check the table1 for list of services:

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/1-x/troubleshooting/b_APIC_Troubleshooting/b_APIC_Troubleshooting_appendix_010001.html#reference_E1C4EF57684F4736AEE735FEC3B35CD3

Also there is an example of how to check the leader for a shard:

apic1# acidiag rvread 6 3  
(6,3,1)  st:6 lm(t):3(2014-10-16T08:48:20.238+00:00) le: reSt:LEADER voGr:0 cuTerm:0x19 lCoTe:0x18 
    lCoIn:0x1800000000001b2a veFiSt:0x31 veFiEn:0x31 lm(t):3(2014-10-16T08:48:20.120+00:00) 
    lastUpdt 2014-10-16T09:08:30.240+00:00
(6,3,2)  st:6 lm(t):1(2014-10-16T08:47:25.323+00:00) le: reSt:FOLLOWER voGr:0 cuTerm:0x19 lCoTe:0x18 
    lCoIn:0x1800000000001b2a veFiSt:0x49 veFiEn:0x49 lm(t):1(2014-10-16T08:48:20.384+00:00) lp: clSt:2 
    lm(t):1(2014-10-16T08:47:03.286+00:00) dbSt:2 lm(t):1(2014-10-16T08:47:02.143+00:00) stMmt:1 
    lm(t):0(zeroTime) dbCrTs:2014-10-16T08:47:02.143+00:00 lastUpdt 2014-10-16T08:48:20.384+00:00
(6,3,3)  st:6 lm(t):2(2014-10-16T08:47:13.576+00:00) le: reSt:FOLLOWER voGr:0 cuTerm:0x19 lCoTe:0x18 
    lCoIn:0x1800000000001b2a veFiSt:0x43 veFiEn:0x43 lm(t):2(2014-10-16T08:48:20.376+00:00) 
    lastUpdt 2014-10-16T09:08:30.240+00:00

Stay safe,

Sergiu

View solution in original post

Sergiu.Daniluk · ‎05-10-2022

Hi @Abhishek Nanda

I believe you will find all the answers to your questions about SDD replacement on APICs here:

https://www.cisco.com/c/en/us/support/docs/cloud-systems-management/application-policy-infrastructure-controller-apic/215166-apic-ssd-replacement.html

About the shards replication: APIC uses a replication factor of 3, meaning each shard has 3 instances/replicas (one active and two backup) across the cluster, regardless of the number of APIC nodes present in the cluster. This means that the replicas will be distributed across all 5 APICs. Because of these, is important to know what happens in case of node failures:

- if one Node fails - all good, cluster is still in RW

- if two Node fails - depending on the distribution, SOME shards will definitely be in read-only. Something like this:

Hope it helps,

Sergiu

Abhishek Nanda · ‎05-11-2022

Hi Sergiu, Have a good day!

athank you for this valuable reply. Is there any way to see how many shards are created and who leads which shard.

Sergiu.Daniluk · ‎05-11-2022

Hi @Abhishek Nanda

yes you can check the states of the shards/replicas.

The basic commands are:

acidiag rvread 
acidiag rvread <service>
acidiag rvread <service> <shard> 
acidiag rvread <service> <shard> <replica>

Check the table1 for list of services:

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/1-x/troubleshooting/b_APIC_Troubleshooting/b_APIC_Troubleshooting_appendix_010001.html#reference_E1C4EF57684F4736AEE735FEC3B35CD3

Also there is an example of how to check the leader for a shard:

apic1# acidiag rvread 6 3  
(6,3,1)  st:6 lm(t):3(2014-10-16T08:48:20.238+00:00) le: reSt:LEADER voGr:0 cuTerm:0x19 lCoTe:0x18 
    lCoIn:0x1800000000001b2a veFiSt:0x31 veFiEn:0x31 lm(t):3(2014-10-16T08:48:20.120+00:00) 
    lastUpdt 2014-10-16T09:08:30.240+00:00
(6,3,2)  st:6 lm(t):1(2014-10-16T08:47:25.323+00:00) le: reSt:FOLLOWER voGr:0 cuTerm:0x19 lCoTe:0x18 
    lCoIn:0x1800000000001b2a veFiSt:0x49 veFiEn:0x49 lm(t):1(2014-10-16T08:48:20.384+00:00) lp: clSt:2 
    lm(t):1(2014-10-16T08:47:03.286+00:00) dbSt:2 lm(t):1(2014-10-16T08:47:02.143+00:00) stMmt:1 
    lm(t):0(zeroTime) dbCrTs:2014-10-16T08:47:02.143+00:00 lastUpdt 2014-10-16T08:48:20.384+00:00
(6,3,3)  st:6 lm(t):2(2014-10-16T08:47:13.576+00:00) le: reSt:FOLLOWER voGr:0 cuTerm:0x19 lCoTe:0x18 
    lCoIn:0x1800000000001b2a veFiSt:0x43 veFiEn:0x43 lm(t):2(2014-10-16T08:48:20.376+00:00) 
    lastUpdt 2014-10-16T09:08:30.240+00:00

Stay safe,

Sergiu

Abhishek Nanda · ‎05-11-2022

Hi Sergiu,

Thank you for the reply and for sharing your knowledge. It is really helpful

Sergiu.Daniluk · ‎05-11-2022

Your very welcome! Happy to hear that information I shared is useful to you.