IM&P v 11.5(1) SU9
Our IM&P HA config has the Publisher at site A and the Sub at site B. We run active/passive, so all users are running from site A and only run from site B if there is a failure at site A or we force a manual failover.
We had a scheduled network outage at site B, so I did a manual failover of the site B node. This was to avoid both nodes from thinking the other was down and trying to failover. However, this still happened. During the outage, both nodes attempted failover and ended up in a failed state requiring manual recovery.
My question: is this expected? If I manually failover the node at site B, should the node at site A not attempt to failover after losing heartbeats with site B. I have done this several times in the past and thought it avoided the scenario I ended up in. Maybe I need to take another step and stop the SRM service?
I have set my heartbeat timers as below to try to avoid this failed node scenario when we have unexpected network outages. Maybe I adjusted them in a way that made it act the way it did.
Critical Service Down Delay 90
Initialization Keep Alive (Heartbeat) Timeout 240
Keep Alive (Heartbeat) Timeout 240
Keep Alive (Heartbeat) Interval 30