12-30-2015 07:23 PM - edited 03-12-2019 10:21 AM
The Spint Brain Recovery condition in Unity Connections is an odd state to find your Unity Connections cluster in, for sure. One thing is certain though, 'something' happened, and this is the result of whatever that 'something' was or is.
When this condition happens, what you'll notice (either from logs or by watching the cluster node status) is that the Primary and HA servers will take turns being the cluster's primary node every few minutes. This is because both cluster nodes have somehow ended up with slightly different database versions and the Server Redundancy Manager service cannot determine exactly which server should be primary. The cluster will try to resolve this issue on it's own, but often times cannot.
The good news (maybe not so good) is that in several cases, the cluster will continue to function and answer calls during a split brain. Where I have seen the split brain condition cause service outage to users is with SCCP integrations that do not have sufficient SCCP ports on both cluster nodes.
What causes this can be a number of things; often it is the result of the two nodes losing network communication between themselves and/or service failure. The resolution to this condition is fairly simple, although, you'd be best served to figure out why it happened in the first place, lest you repeat it again.
Since Unity Connections is running the same operating system as Unified Communications Manager, you can run the same health checks in Unity Connections that you would run you Unified Communications Manager. You'll want to resolve any issues discovered in those health checks before resolving the split brain issue.
Assuming you have a healthy Unity Connections cluster and/or discovered and resolved the issue that caused the split brain you'll want to move on to resolution.
In some instances (usually due to how long the cluster was in this state), even more action is necessary and you may need to reset the cluster replication (done via the CLI of the primary server with, utils dbreplication reset *) if the cluster's database replication is damaged. The need to do this will be discovered in your final set of health checks.
Thanks Ryan for this post, I'm currently troubleshooting this exact problem. In my case Cisco TAC is pointing to Cisco Bug CSCug53756 because of the presence of core dump ServM files on the servers.
https://quickview.cloudapps.cisco.com/quickview/bug/CSCug53756
But I wonder whether the presence of these core dumps is a result of the problem you described, and not the cause. During the Split Brain recovery, the system load goes through the roof on both the Publisher and Subscriber server making the entire Unity system nonresponsive to even CLI, and I'm guessing ServM is crashing because of the high load.
Anyone have any thoughts on this? Cisco assures me upgrading to the latest release of my Unity Connection version will resolve the issue...
While I have never faced the particular bug you're referencing; if TAC is willing to assure resolution by upgrading, I would consider it. Unity Connections, IMO, is the easiest to upgrade and least likely to have compatibility issues.
However, if you are still in a split brain recovery state, you may also consider the resolution I proposed, if you have not tried that already. It may be a quicker resolution than an upgrade, especially if you're not expecting to have to do it.
Thanks Ryan, your resolution actually is what it took to get my Unity Connection working again. I really appreciate finding your post on this issue.
I think my problem was caused by a couple momentary blips in network access due to spanning-tree recalculation that happened right around the time that my Unity problems started.
Prior to shutting down both the Publisher and Subscriber the symptoms I was having were the Split Brain problem described above, along with periods of enormously high load on both servers, +200 Load Average making even the CLI unresponsive. I'm guessing the high load is what ultimately lead to the ServM core dumps, but I have no way of knowing. But in any case, my Unity Connection cluster was completely unusable during those periods. For the moment my Unity Connection is back to functioning correctly while only running the Publisher and the Subscriber is still shut off, I'll probably turn it on next week.
TAC seems to be ignoring the Split Brain symptoms and only wants to focus on the core dumps. TAC points to the Bug article I linked and simply tells me to upgrade to the latest. I'm working on trying to get them back focused on the split brain problem instead of the core dumps.
I'm probably going to upgrade to the latest version anyway in hopes it will help, but from other searching online I see other people with very recent versions of Unity still having the Split Brain problem, so I'm doubtful the latest version of 9.x will resolve the issue...
It is possible that a code weakness, makes a particular scenario more susceptible to the conditions that caused the split brain; and that by upgrading, you'll remove the weakness.
Alternatively, you've addressed the symptoms and barring any network anomalies, should be good to go. However, I do think that upgrading will best serve you in the long term.
-Ryan
I greatly appreciate your comments on this. Yeah it's unlikely TAC will come back with anything other than upgrade, so that's what I will do. I won't waste any more time on troubleshooting considering they spent 3+ hours trying to get my Unity Connection system usable again to no avail, finally gave up and collected some logs and said they would have to get back to me.
I ended up finding your fix after getting off the phone with them. Apparently TAC is unaware of your fix for the Split Brain problem.
The health checks link no longer functions. What were they?
Ryan, good article. Thank you. I love getting documents that are succinct especially when they resolve my issue. I did notice that your Health Check Link took me to a gaming site.
Thanks for the article!
Hola, también se me presentó el error "Split Brain Resolution" cuando ingresé el comando en el PUB Unity ("show cuc cluster status").
Server Name Member ID Server State Internal State Reason
----------- --------- ------------ -------------- ------ -----------
server-unity-one 0 Split Brain Resolution Pri SBR Normal
server-unity-two 1 Disconnected Unknown Unknown
Y la solución mas efectiva fue apagar el SUB Unity (server-unity-two), luego de estar apagado totalmente reinicié el principal PUB Unity (server-unity-one). Esperé a que subieran todos los servicios (observar en Cisco Unity Connection Serviceability -> Tools -> Service Management [20 minutos aprox.]). Y finalmente encendí el SUB, esperé a que cargara la GUI nuevamente e ingrese el comando.
admin:show cuc cluster status
Server Name Member ID Server State Internal State Reason
----------- --------- ------------ -------------- ------ ----------
server-unity-one 0 Primary Pri Active Normal
server-unity-two 1 Secondary Sec Active Normal
Finalmente, se deben realizar pruebas de llamadas a los IVRs, dejar un mensaje de voz y escucharlo, correr los comandos de diagnostico y replicación de base de datos, y que todos los servicios hayan quedado en Activated Started en "Service Management".
Eso es todo.
Gracias
Saludos
Pdta: Antes que todo validar el backup y tener acceso al VM para encender el SUB nuevamente.
Saved me and our team a lot of work tonight on Unity 11.5 stuck in “split brain” after some SAN issues !! Great step by step guide !! Thanks
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: