10-29-2014 03:57 AM - edited 03-01-2019 11:53 AM
Hello everyone,
I come with a Question about the Cisco solution to avoid the split brain in a UCS domain.
I tested this solution, and the first test was concluant, it work well. After that, i wanted to do the same test but with changing the primary Fabric Interconnect.
After changing my primary FI i did the test a second time. The issue i had is i completely lost my GUI and it was impossible to reconnect to it. No problem on my production just lose my GUI. After reconnecting my L1/L2 link everything become Ok.
I wanted to know if it come from my Fabric Interconnect so did it on a second UCS Domain and same problem.
Did anyone saw this problem before ?
Configuration:
2 Fabric Interconnect
3 Chassis
3 Blade per chassis
Firmware version : 2.2(3b)
10-29-2014 05:33 AM
Can you be more concrete: what was your test procedure ? what do you mean with ...after changing my primary FI...... ?
Q. did you enable the management interface monitoring ?
Admin->Communication Management-> Management Interfaces-> Management Interface Monitoring Policy
10-29-2014 08:40 AM
I agree with Walter, you are missing a lot of details of what you did and when you did it... If you change the FI role (primary/subordinate) and did not give the system enough to failover, you might be seen something expected
-Kenny
10-29-2014 09:26 AM
Ok i will try to be more precise
i willist each step of my procedure:
1 - UCS domain in initial state
-Fabric Interconnect A = Primary
-Fabric Interconnect B = Subordinate
-3 Chassis Operable
-Management Interface monitoring configure in MII Status
2 - Disconnect of L1/L2 link
- Waiting for Quorum
- Cluster checking with show "cluster extended state" command everything is ok
- After several minute the GUI is Ok and i can continue to use it.
3 - Go back to initial state
- reconnect L1/L2 link
- Waiting for the errros to disappear
4 - Changing Fabric Interconnect state
- I give the lead to Fabric B with the "cluster lead b" command
- Waiting 15 minutes to be sure that everything is ok
- Chack in the cluster state and in the GUI tha the Fabric B is Primary
5 - New L1/L2 Disconnect
- Waiting for Quorum
- Cluster checking with show "cluster extended state" command everything is ok
- After several minute the GUI disconenct and i cant connect anymore (connection timeout when i try to connect to it)
6 - Go back to initial state
- reconnect L1/L2 link
- I can acces to the GUI again
Hope this process can help you
Thanks for the help
10-29-2014 09:59 AM
Were you able to ping FI-B (not the cluster IP) when that happened?
-Kenny
10-29-2014 10:20 AM
The master FI is the one which runs UCS Manager application; moving the master from A to B (with CLI !) also moves the UCS Manager; it is obvious that you loose the session.
Did you really connect to the VIP IP address ? and Management Interface Monitoring enabled ?
It is totally irrelevant which FI plays master resp. slave. Both fabrics are active for FC and Ethernet frame switching.
10-30-2014 03:41 AM
I know that, i was on the good Fabric Interconnect.
even if i try to connect to the VIP i have a Connection Timeout.
I can ping all of my IP Fabric A/B and the VIP.
and my management Interface Monitoring was enabled
10-30-2014 04:28 AM
Very strange !
Q. Did you check in above 6) that UCSM is running on FI B ?
Could you check the mac addresses of FI A and B, and VIP, and verify if ping to VIP is coming from FI B ?
I could only guess that UCSM will be moved to B, but VIP is still pointing to A ?
Bug ?
10-30-2014 06:17 AM
I re-read the thread to be sure I was not missing anything and I can read that you mentioned "Cluster checking with show "cluster extended state" command everything is ok" ... When you checked, did you see the output specified that HA was ready? and did you also confirm that next to the chassis, at the bottom of the "show cluster extended-state" the chassis showed as "active" and not "active with errors" or "pending IO transactions" ? It is not just a matter of checking the FI is the new primary, there are internal changes that need to take place, and even though it should not take 30 minutes, in a non-healthy environment this not gonna take 30 seconds either.
I think that a good option here would be to reproduce the scenario and paste here the output of the command "show cluster extended-state" and the following commands during different time intervals during the testing:
UCS-250-B# connect nxos a|b << use "a" or "b" depending on the FI you're testing
UCS-250-A(nxos)# show int mgmt 0
UCS-250-A(nxos)# exit
UCS-250-B#connect local-mgmt a|b << use "a" or "b" depending on the FI you're testing
UCS-250-A(local-mgmt)# show mgmt-ip-debug
HTH,
-Kenny
11-03-2014 02:15 AM
I see that there is some internal changes to wait. I cant access to my ucs domain for the moment but when i used the "show cluster extended state" command i saw that all my chassis was active. do you know how long i take to apply a Fabric Failover with the "Cluster lead command" ? do you know where i can foun this information ?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide