Cannot Find Cause of CUCM CPU Pegging

HeatherHoffman73013 · ‎07-05-2024

Hello,

Cisco 11.5.1.23900-30 - 1 Pub, 1 Sub, 2 IM&P (Pub-Sub) on this cluster.

Here is the pattern of what happens every day now:

When I get up in the morning I see which node (Pub or Sub) is "hung" from CPU Pegging and reboot it from Vsphere. It takes about 25-30 minutes to settle down. Both Pub and Sub are then quiet throughout the work day and in the normal range of 280-450 MHz of CPU usage. I have a Cisco TAC SR open on this but so far, their suggestions (changing the LDAP interval to weekly) have not helped.

The types of HWM/LWM alerts like the one below start to increase as the day progresses:

At Fri Jul 05 14:44:21 EDT 2024 on node PSICMSUBPA01.MESSAGING.DOM, the following SyslogSeverityMatchFound events generated:

SeverityMatch : Critical

MatchedEvent : Jul 5 14:44:03 PSICMSUBPA01 local7 2 LpmTool: 2: PSICMSUBPA01.MESSAGING.DOM: Jul 05 2024 18:44:03.400 UTC : %UC_LPMTCT-2-LogPartitionHighWaterMarkExceeded: %[UsedDiskSpace=23][MessageString=Common Disk utilization hits HWM!! Purging files...][AppID=Cisco Log Partition Monitoring Tool][ClusterID=][NodeID=PSICMSUBPA01]: The percentage of used disk space in the log partition has exceeded the configured high water mark.

AppID : Cisco Syslog Agent

ClusterID :

NodeID : PSICMSUBPA01

TimeStamp : Fri Jul 05 14:44:03 EDT 2024

I now have the HWM/LWM set to the lowest parameters. CPU pegging starts at approx. 12:00 AM which is also when the backup (DRS) is scheduled to run. There is a then a lull until about 3:00 AM and constant alerts about HWM and LWM alerts start up again and then about 5:00 AM the CPU pegging starts again and doesn’t stop until the server gets hung and I have to restart it from Vsphere.

I have the HWM/LWM settings in RTMT

CPU pegging alert ex:

Processor load over configured threshold for configured duration of time . Configured high threshold is 91 % ccm (69 percent) uses most of the CPU.

Processor_Info:

For processor instance _Total: %CPU= 99, %User= 67, %System= 32, %Nice= 0, %Idle= 0, %IOWait= 0, %softirq= 1, %irq= 0.

For processor instance 0: %CPU= 99, %User= 67, %System= 32, %Nice= 0, %Idle= 0, %IOWait= 0, %softirq= 1, %irq= 0.

The alert is generated on Fri Jul 05 05:01:21 EDT 2024 on node PSICMPUBPA01.MESSAGING.DOM.

Memory_Info: %Mem Used= 83, %VM Used= 57.

Partition_Info:

Common: %Disk Used=67.

Swap: %Disk Used=18.

Active: %Disk Used=95.

Process_Info: processes with D-State:

If anyone has any suggestions, I would greatly appreciate it.

Andrew Skelly · ‎07-18-2024

From the offending server, have you logged in to CLI and run either of these commands?

show process using-most cpu

show process using-most memory

Those will let you know which process is consuming high amounts of CPU and memory. Would be a good place to start looking.

Please rate helpful posts by clicking the thumbs up!