cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1220
Views
4
Helpful
4
Replies

RTMT-ALERT PEPeerNodeFailure

Rene Mueller
Level 5
Level 5

Hi Folks!

We have three Unified Presence Nodes running with Systemversion 8.6.4.12900-2. Since I enabled RTMT Email Alerts, I get the follogin Alert from time to time from different nodes (not always the same). Strange think is, that if I check service via serviceability interface, it is up and running without a new downtime:

[RTMT-ALERT-StandAloneCluster12487] PEPeerNodeFailure

PEPeerNodeFailureAlarmMessage : Node pe54005002: OUT-OF-SERVICE

AppID : Cisco UP Presence Engine

ClusterID : StandAloneCluster12487

NodeID : cups-01

TimeStamp : Fri Mar 15 09:59:00 CET 2013.

The alarm is generated on Fri Mar 15 09:59:00 CET 2013.

Hope someone can help me in this.

Thanks.

Regards

4 Replies 4

John Watkins
Level 4
Level 4

I'am also seeing a similar problem on System version: 8.6.5.10000-12. Followed by a "Split-Brain" effect where users from one node cannot see the status or IM contacts registered to the other node. Any thoughts?

PEPeerNodeFailureAlarmMessage : Node pe456272906:  OUT-OF-SERVICE

AppID : Cisco UP Presence Engine ClusterID : StandAloneCluster8a2db

NodeID : sdep-cup02 

TimeStamp : Thu Jan 23 22:00:54 CST 2014.

The alarm is generated on Thu Jan 23 22:00:54 CST 2014.


Manish Gogna
Cisco Employee
Cisco Employee

Please check the following bug

https://tools.cisco.com/bugsearch/bug/CSCuf74738/?reffering_site=dumpcr

PEPeerNodeFailureAlarmMessage alerts seen regularly in RTMT

CSCuf74738

Symptom:
PEPeerNodeFailureAlarmMessage alerts seen regularly in RTMT
Conditions:
No particular conditions are met, it tends to happen over night.
Workaround:
None

HTH

Manish

Rene Mueller
Level 5
Level 5

Hi,

since I set the Trace Levels on Presence Server all back to default settings (some of them had a debug level setup), the error didn't come up again. We had the same error of inconsistance availability status.

Thanks Rene,

We also experienced a network fail-over that seemed to kick off all our problems. After the CUP servers replicated and started migrating users, the nodes would reach high Virtual Memory and CPU and the processes would either crash or the servers would hang.

during after hours we were able to at least stabalize the servers so they weren't crashing (still had high VirtualMemory and SWAP) with some speradic CPU util, and that's when we noticed the "split Brain" effect.

Based on your suggestion Rene, we looked at the trace settings and noticed all were set to debug. We turned them all off and restarted the CUP XCP Config manager, and then CUP XCP Router on the Sub node. the Sub node came back up and we were able to see status and IM to user one the PUB.  After a few days of a working scenario we restart the CUP XCP Config and CUP XCP Router on the PUB, which restored virtualmemory and SWAP back to a normal operating conditions.