DBReplication Runtimestate showing odd behavior across Nodes

We’ve been fighting CSCvc16004 for a bit on our cluster here.  We use 3rd Party CA-Signed Certs on our IPSEC services and have IPSEC Policies set up between the nodes.  Well fast forward 3 years and those certificates expire 1 June and we cannot disable the policies without breaking communications/replication between the nodes.  We have been working with a Great TAC engineering team and yesterday we implemented Self-signed certs for IPSEC on the cluster and was able to have replication across the cluster without Policies enabled.  This gets us past our 1 June deadline and we can begin scheduling an upgrade from 11.5SU6 to 12.5 where the Opensawn / Libreswan no longer is an issue.


But… we are experiencing some super odd behavior now between nodes.  We have a Pub, 2 Subs and 2 TFTP servers on the cluster, and you can run utils dbreplication runtimestate one minute and get one result, then run it again minutes later and the result completely changes with no discernible pattern to it. (see attached)


Our servers are on 2 B-series M4 Blades across campus from each other – 250 subnet in Building A, 251 in Building C.  One minute the 250 Sub can ping the 250 Pub and all is well, the next it cannot ping the Pub and it’s on the same blade.


If anyone has seen this or has some ideas lemme hear it. 


SIDE NOTE:  Publisher always comes back with Y/Y/Y, good ping times, and (2) Setup Complete on all nodes.  It is the Subs and TFTPs that exhibit this behavior.  There has yet to be any Repl Queue, and NTP seems to be solid from Pub to the other nodes, so this flapping/flopping isn't lasting long enough to affect the system that we can tell, but it just isn't right.


Thanks guys,