Can you be more concrete:

fr_fiocre · ‎10-29-2014

Hello everyone,

I come with a Question about the Cisco solution to avoid the split brain in a UCS domain.

I tested this solution, and the first test was concluant, it work well. After that, i wanted to do the same test but with changing the primary Fabric Interconnect.

After changing my primary FI i did the test a second time. The issue i had is i completely lost my GUI and it was impossible to reconnect to it. No problem on my production just lose my GUI. After reconnecting my L1/L2 link everything become Ok.

I wanted to know if it come from my Fabric Interconnect so did it on a second UCS Domain and same problem.

Did anyone saw this problem before ?

Configuration:

2 Fabric Interconnect

3 Chassis

3 Blade per chassis

Firmware version : 2.2(3b)

Walter Dey · ‎10-29-2014

Can you be more concrete: what was your test procedure ? what do you mean with ...after changing my primary FI...... ?

Q. did you enable the management interface monitoring ?

Admin->Communication Management-> Management Interfaces-> Management Interface Monitoring Policy

Keny Perez · ‎10-29-2014

I agree with Walter, you are missing a lot of details of what you did and when you did it... If you change the FI role (primary/subordinate) and did not give the system enough to failover, you might be seen something expected

-Kenny

fr_fiocre · ‎10-29-2014

Ok i will try to be more precise

i willist each step of my procedure:

1 - UCS domain in initial state

-Fabric Interconnect A = Primary

-Fabric Interconnect B = Subordinate

-3 Chassis Operable

-Management Interface monitoring configure in MII Status

2 - Disconnect of L1/L2 link

- Waiting for Quorum

- Cluster checking with show "cluster extended state" command everything is ok

- After several minute the GUI is Ok and i can continue to use it.

3 - Go back to initial state

- reconnect L1/L2 link

- Waiting for the errros to disappear

4 - Changing Fabric Interconnect state

- I give the lead to Fabric B with the "cluster lead b" command

- Waiting 15 minutes to be sure that everything is ok

- Chack in the cluster state and in the GUI tha the Fabric B is Primary

5 - New L1/L2 Disconnect

- Waiting for Quorum

- Cluster checking with show "cluster extended state" command everything is ok

- After several minute the GUI disconenct and i cant connect anymore (connection timeout when i try to connect to it)

6 - Go back to initial state

- reconnect L1/L2 link

- I can acces to the GUI again

Hope this process can help you

Thanks for the help

Keny Perez · ‎10-29-2014

Were you able to ping FI-B (not the cluster IP) when that happened?

-Kenny

Walter Dey · ‎10-29-2014

The master FI is the one which runs UCS Manager application; moving the master from A to B (with CLI !) also moves the UCS Manager; it is obvious that you loose the session.

Did you really connect to the VIP IP address ? and Management Interface Monitoring enabled ?

It is totally irrelevant which FI plays master resp. slave. Both fabrics are active for FC and Ethernet frame switching.

fr_fiocre · ‎10-30-2014

I know that, i was on the good Fabric Interconnect.

even if i try to connect to the VIP i have a Connection Timeout.

I can ping all of my IP Fabric A/B and the VIP.

and my management Interface Monitoring was enabled

Walter Dey · ‎10-30-2014

Very strange !

Q. Did you check in above 6) that UCSM is running on FI B ?

Could you check the mac addresses of FI A and B, and VIP, and verify if ping to VIP is coming from FI B ?

I could only guess that UCSM will be moved to B, but VIP is still pointing to A ?

Bug ?

Keny Perez · ‎10-30-2014

I re-read the thread to be sure I was not missing anything and I can read that you mentioned "Cluster checking with show "cluster extended state" command everything is ok" ... When you checked, did you see the output specified that HA was ready? and did you also confirm that next to the chassis, at the bottom of the "show cluster extended-state" the chassis showed as "active" and not "active with errors" or "pending IO transactions" ? It is not just a matter of checking the FI is the new primary, there are internal changes that need to take place, and even though it should not take 30 minutes, in a non-healthy environment this not gonna take 30 seconds either.

I think that a good option here would be to reproduce the scenario and paste here the output of the command "show cluster extended-state" and the following commands during different time intervals during the testing:

UCS-250-B# connect nxos a|b << use "a" or "b" depending on the FI you're testing

UCS-250-A(nxos)# show int mgmt 0

UCS-250-A(nxos)# exit

UCS-250-B#connect local-mgmt a|b << use "a" or "b" depending on the FI you're testing

UCS-250-A(local-mgmt)# show mgmt-ip-debug

HTH,

-Kenny

fr_fiocre · ‎11-03-2014

I see that there is some internal changes to wait. I cant access to my ucs domain for the moment but when i used the "show cluster extended state" command i saw that all my chassis was active. do you know how long i take to apply a Fabric Failover with the "Cluster lead command" ? do you know where i can foun this information ?

L1/L2 Link Failure Test Issue