07-05-2024 12:37 PM
Hello,
Cisco 11.5.1.23900-30 - 1 Pub, 1 Sub, 2 IM&P (Pub-Sub) on this cluster.
Here is the pattern of what happens every day now:
When I get up in the morning I see which node (Pub or Sub) is "hung" from CPU Pegging and reboot it from Vsphere. It takes about 25-30 minutes to settle down. Both Pub and Sub are then quiet throughout the work day and in the normal range of 280-450 MHz of CPU usage. I have a Cisco TAC SR open on this but so far, their suggestions (changing the LDAP interval to weekly) have not helped.
The types of HWM/LWM alerts like the one below start to increase as the day progresses:
At Fri Jul 05 14:44:21 EDT 2024 on node PSICMSUBPA01.MESSAGING.DOM, the following SyslogSeverityMatchFound events generated:
SeverityMatch : Critical
MatchedEvent : Jul 5 14:44:03 PSICMSUBPA01 local7 2 LpmTool: 2: PSICMSUBPA01.MESSAGING.DOM: Jul 05 2024 18:44:03.400 UTC : %UC_LPMTCT-2-LogPartitionHighWaterMarkExceeded: %[UsedDiskSpace=23][MessageString=Common Disk utilization hits HWM!! Purging files...][AppID=Cisco Log Partition Monitoring Tool][ClusterID=][NodeID=PSICMSUBPA01]: The percentage of used disk space in the log partition has exceeded the configured high water mark.
AppID : Cisco Syslog Agent
ClusterID :
NodeID : PSICMSUBPA01
TimeStamp : Fri Jul 05 14:44:03 EDT 2024
I now have the HWM/LWM set to the lowest parameters. CPU pegging starts at approx. 12:00 AM which is also when the backup (DRS) is scheduled to run. There is a then a lull until about 3:00 AM and constant alerts about HWM and LWM alerts start up again and then about 5:00 AM the CPU pegging starts again and doesn’t stop until the server gets hung and I have to restart it from Vsphere.
I have the HWM/LWM settings in RTMT
CPU pegging alert ex:
Processor load over configured threshold for configured duration of time . Configured high threshold is 91 % ccm (69 percent) uses most of the CPU.
Processor_Info:
For processor instance _Total: %CPU= 99, %User= 67, %System= 32, %Nice= 0, %Idle= 0, %IOWait= 0, %softirq= 1, %irq= 0.
For processor instance 0: %CPU= 99, %User= 67, %System= 32, %Nice= 0, %Idle= 0, %IOWait= 0, %softirq= 1, %irq= 0.
The alert is generated on Fri Jul 05 05:01:21 EDT 2024 on node PSICMPUBPA01.MESSAGING.DOM.
Memory_Info: %Mem Used= 83, %VM Used= 57.
Partition_Info:
Common: %Disk Used=67.
Swap: %Disk Used=18.
Active: %Disk Used=95.
Process_Info: processes with D-State:
If anyone has any suggestions, I would greatly appreciate it.
07-18-2024 12:36 PM
From the offending server, have you logged in to CLI and run either of these commands?
show process using-most cpu
show process using-most memory
Those will let you know which process is consuming high amounts of CPU and memory. Would be a good place to start looking.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide