RTMT-ALERT PEPeerNodeFailure

Rene Mueller · ‎03-15-2013

Hi Folks!

We have three Unified Presence Nodes running with Systemversion 8.6.4.12900-2. Since I enabled RTMT Email Alerts, I get the follogin Alert from time to time from different nodes (not always the same). Strange think is, that if I check service via serviceability interface, it is up and running without a new downtime:

[RTMT-ALERT-StandAloneCluster12487] PEPeerNodeFailure

PEPeerNodeFailureAlarmMessage : Node pe54005002: OUT-OF-SERVICE

AppID : Cisco UP Presence Engine

ClusterID : StandAloneCluster12487

NodeID : cups-01

TimeStamp : Fri Mar 15 09:59:00 CET 2013.

The alarm is generated on Fri Mar 15 09:59:00 CET 2013.

Hope someone can help me in this.

Thanks.

Regards

John Watkins · ‎01-24-2014

I'am also seeing a similar problem on System version: 8.6.5.10000-12. Followed by a "Split-Brain" effect where users from one node cannot see the status or IM contacts registered to the other node. Any thoughts?

PEPeerNodeFailureAlarmMessage : Node pe456272906: OUT-OF-SERVICE

AppID : Cisco UP Presence Engine ClusterID : StandAloneCluster8a2db

NodeID : sdep-cup02

TimeStamp : Thu Jan 23 22:00:54 CST 2014.

The alarm is generated on Thu Jan 23 22:00:54 CST 2014.

Manish Gogna · ‎01-24-2014

Please check the following bug

https://tools.cisco.com/bugsearch/bug/CSCuf74738/?reffering_site=dumpcr

PEPeerNodeFailureAlarmMessage alerts seen regularly in RTMT

CSCuf74738

Description

Symptom:
PEPeerNodeFailureAlarmMessage alerts seen regularly in RTMT
Conditions:
No particular conditions are met, it tends to happen over night.
Workaround:
None

HTH

Manish

Rene Mueller · ‎01-25-2014

Hi,

since I set the Trace Levels on Presence Server all back to default settings (some of them had a debug level setup), the error didn't come up again. We had the same error of inconsistance availability status.

John Watkins · ‎01-28-2014

Thanks Rene,

We also experienced a network fail-over that seemed to kick off all our problems. After the CUP servers replicated and started migrating users, the nodes would reach high Virtual Memory and CPU and the processes would either crash or the servers would hang.

during after hours we were able to at least stabalize the servers so they weren't crashing (still had high VirtualMemory and SWAP) with some speradic CPU util, and that's when we noticed the "split Brain" effect.

Based on your suggestion Rene, we looked at the trace settings and noticed all were set to debug. We turned them all off and restarted the CUP XCP Config manager, and then CUP XCP Router on the Sub node. the Sub node came back up and we were able to see status and IM to user one the PUB. After a few days of a working scenario we restart the CUP XCP Config and CUP XCP Router on the PUB, which restored virtualmemory and SWAP back to a normal operating conditions.