CUCM 8.6 - DRF cpu 99% alert

Ayaz Khan · ‎07-12-2012

Hello,

We have been getting the following alert daily,

From: [mailto:RTMT_Admin@thedominion.ca]
Sent: Tuesday, July 10, 2012 2:04 AM
To: voiperror
Subject: [RTMT-ALERT-StandAloneCluster] CallProcessingNodeCpuPegging

Processor load over configured threshold for configured duration of time . Configured high threshold is 90 % CiscoDRFLocal (52 percent) uses most of the CPU.

Processor_Info:

For processor instance _Total: %CPU= 99, %User= 57, %System= 8, %Nice= 31, %Idle= 0, %IOWait= 0, %softirq= 2, %irq= 0.

For processor instance 0: %CPU= 99, %User= 57, %System= 8, %Nice= 31, %Idle= 0, %IOWait= 0, %softirq= 2, %irq= 0.

The alert is generated on Tue Jul 10 02:03:51 EDT 2012 on node 10.76.160.60.

Memory_Info: %Mem Used= 59, %VM Used= 39.

Partition_Info:
Swap: %Disk Used=0.
Active: %Disk Used=84.
Common: %Disk Used=37.

Process_Info: processes with D-State:

I have restarted DRF services and will observe to see if I get the alert again. If anyone has encounted this issue please let me know how it was resolved. Any ideas or insight would be appreciated.

Thanks,

Ayaz

nikshah · ‎07-12-2012

Ayaz can you provide with the exact ccm version

Sent from Cisco Technical Support Android App

Ayaz Khan · ‎07-31-2012

We had restarted DRF services and the issue appeared to be got, but its back.

System version: 8.6.2.20000-2

Any insight would be appreciated.

Ayaz

arvindrosunee · ‎08-10-2012

This could be related to BugID: CSCtu18692 - UCS/VMware - CallProcessingNodeCpuPegging alert during DRF/BAT

arunkum3 · ‎08-10-2012

Hi Ayaz,

Please open a TAC case. Normally the restart of the DRF master and local agent should fix the issue. Since you are able to see the issue appearing again and again (even after the restart) AND the CUCM version is 8.6, it could be the bug CSCtu18692 or could be a new bug. If you want to be sure you are hitting CSCtu18692, try the workaround mentioned in that bug to see if it resolves the issue.

Regards,

Arun Kumar

Please rate useful posts !!

steven.mcmaster · ‎12-13-2012

Hi Ayaz.

I have an interesting issue here as well with only two nodes in the cluster (of 4) that experience this issue.

PUB, SUB1, and SUB2 were built on UCS using the 8.5 OVA, but later updated with 8GB RAM and 2 vCPU. SUB3 was built on the 8.6 OVA as we had numerous bugs with 8.5 and patched to 8.6 before SUB3 was able to be built.

After each upgrade to a 8.6 ES, we see the TAR archive grow on the ftp server, mainly being a large TFTP tar archive.

To date, only the PUB and SUB3 experience the issue during the DRF schedule, and only sometimes during a 6 day schedule (CPUPegging is raised one or twice a fortnight/month).

My opinion is, the bug CSCtu18692 is not actually just an issue with the 8.5 OVA. As this environment has two nodes affected but based on the 8.5 and 8.6 OVAs from CCO.

Either DRF needs to be run at a lower priority process or there needs to be a *new* purge tool that cleans out firmware from the tftp directory that is not listed within the dependancy records (ie. device defaults or phone load field on phone device)

On another note, another customer that has moved from 6.1.4a->7.1.3-> 8.6.2a has seen the tftp archive grow from 325MB to 1.09GB, which confirms some suspicion i have about the system upgrade process.

Under further inspection, there are about 3 device packs listed in the 6.1 archive, but i cant view the 8.6 archive as cisco seem to use some random pre-extraction process prior to tar extraction. But nonetheless there seems to be no apparent reason to keep all the firmware from every prior release, yet the upgrade process seems to do this.

I'd suggest getting into cli -> file list tftp * detail, and start cleaning out the volume. Its ugly.

roboliveira · ‎01-17-2013

Ayaz,

We are running into this same issue with a customer of ours. Did TAC ever give you a resolution to this issue?

Thanks in advance.

~ Rob Oliveira ~

steven.mcmaster · ‎01-20-2013

HI Rob.

I managed to get the 8.6 OVA node tftp archive total down to 830MB, but it is still firing the RTMT alert. I'll try and get some more files off (cius loads) and get it to around 500-600MB. I'll try and report back the results, but the TAC reckons you need to rebuild the affected node using an 8.6 OVA... As i previously mentioned, two nodes report this (1 built on 8.5 and the other 8.6 OVA). It really is a strange issue and i wont be rebuilding these boxes based on a lazy response from TAC/DEV. It extreme, but anyone out there willing to try it could rebuild/restore. Just make sure you share your results

UPDATE:

After removal of the cius loads which are tar gzipped archives it seems to have been resolved. I will monitor the situation and confirm the workaround following customer reponse on the problem.

JLimone1430 · ‎03-29-2013

I have this issue as well. TAC suspected it was the bug noted prior, and asked me to add a vCPU to each node. I did that, and the problem went away for about a month or two, but has since come back. Our environment was an 8.6 build from the start using the 8.6 OVA templates. FYI - the only difference I saw betwen the 8.5 OVA and the 8.6 OVA was a 100MHz bump to the vCPU, so I think this issue goes deeper. In reviewing documentation, it is still an open bug in 9.1. I agree with Steven that this is an issue that Cisco needs to fix by either lowering the DRF priority or cleaning out old/unused files.

Kenneth Russell · ‎04-09-2013

If you examine the alert output you can see that some of the process load is actually low priority / "nice" CPU usage. The gzip and tar processes that get invoked by the backup for TFTP are both run at nice 19.

However, the CallProcessingNodeCpuPegging alert only looks at the total CPU percentage. It doesn't subtract nice CPU usage when making the calculation. It's arguable both ways whether the low priorty / "nice" CPU usage should be included in this alert or not.

It would be interesting to see some successive runs of "show process load" output from a system around the time of this alert for a system which triggers this alert during DRS even with 2 vCPUs. It isn't hard to see the CallProcessingNodeCpuPegging alert during DRS with 1vCPU but should be very rare with 2vCPUs, short of lots of other load on the VM.

I also agree we need a better way to cleanup old phone firmware from the TFTP path. It's a very manual process currently.

The OVA concern is that the original install has to be done with an OVA (instead of manual VM creation) to ensure that the partitions are aligned. If they are unaligned then the I/O impact of any operation is magnified on the underlying datastore.

neil.woolloff · ‎07-05-2013

Hello,

Also have a customer with this issue, yet on Unity Connection not CUCM.

Do we think this bug could hit them both?

Also,

How do you shrink the tftp areas you are describing? And does it make any difference to a restore? or even shrink size of drf files & time?

Thanks,

Neil