Troubleshooting process crashes on UCCX servers

Vladimir Savostin · ‎09-09-2020

Introduction
Exact backtrace match to single defect
Exact backtrace match to multiple defects
Generic core – performance issue
Generic core – memory leak
No matches - potentially new defect (not caused by performance or memory leak)

Introduction

While analyzing backtrace output for process crash on UCCX servers and matching it to exiting defects you may end up with one of the following scenarios:

Exact backtrace match to single defect

Check system version where backtrace was taken from to confirm that defect is impacting product deployment ('Known Fixed Releases' in BST Tool should be higher than current system version)
Check defect Release Notes to see if there are specific conditions matching failing scenario
Check defect Release Notes to see if there is any workaround exist to quickly fix the issue if system’s technical requirements allow to apply workaround
To permanently fix the issue and avoid further service crashes consider upgrading system to the highest (or above) release specified in 'Known Fixed Releases' section of BST Tool for matching defect
If the defect does not match system version (current version is higher than any 'Known Fixed Releases' in matching defect for the respective system release) that means you are potentially facing a new crash which is not known yet to Cisco engineering team
For any new crash following dataset needs to be collected prior to opening TAC Service Request:
- Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
- Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
- EventViewer-Application Log and EventViewer-System log for a week period till the crash date
- Core file acquired from the system
- Output of CLI command ‘utils core active analyze <core filename>’
- Additionally, based on the crashing service collect:
  - For “Unified CM telephony subsystem” crash, make sure to collect “CCX engine logs” along with “Unified CM telephony client “logs
  - For nmon process crash, make sure to collect “Finesse Tomcat” along with “Cisco Tomcat” logs

Exact backtrace match to multiple defects

Several defects are returned which means that every defect resolves the issue where exactly same backtrace is being generated. These defects might be related to each other or have different root causes
Check system version where backtrace was taken from and choose defects which are potentially impacting product deployment ('Known Fixed Releases' in BST Tool should be higher than current system version)
For defects selected on the previous step check Release Notes to see if there are specific conditions matching failing scenario
For defects selected on previous step check Release Notes to see if there is any workaround exist to quickly fix the issue if system’s technical requirements allow to apply workaround
To permanently fix the issue and avoid further service crashes consider upgrading system to the highest (or above) release specified in 'Known Fixed Releases' section of BST Tool for any matching defect
If there are no resolved defects matching system version (current version is higher than any 'Known Fixed Releases' in any matching defect for the respective system release) that means you are potentially facing a new crash which is not known yet to Cisco engineering team
For any new crash following dataset needs to be collected prior to opening TAC Service Request:
- Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
- Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
- EventViewer-Application Log and EventViewer-System log for a week period till the crash date
- Core file acquired from the system
- Output of CLI command ‘utils core active analyze <core filename>’
- Additionally, based on the crashing service collect:
  - For “Unified CM telephony subsystem” crash, make sure to collect “CCX engine logs” along with “Unified CM telephony client “logs
  - For nmon process crash, make sure to collect “Finesse Tomcat” along with “Cisco Tomcat” logs

Generic core – performance issue

This coredump was generated because service process could not get necessary CPU resources to continue operation
Usually this means that underlying infrastructure (hardware or virtualization layer) was not able to provide enough CPU resources when application needed to run
This may happen due to the high IO, blocking operation on VM level (for example with snapshots) or intensive CPU operations within virtual machine
Due to that fact you need to look for a root cause outside of the application itself, i.e.:
- Hardware components and firmware versions on the host server are compatible with VMware vSphere version. For Cisco UCS please use this link: https://ucshcltool.cloudapps.cisco.com/public/
- VM was deployed from the template downloaded from cisco.com website
- VM template used to deploy application virtual machine is compatible with current VMware vSphere version
- Resources allocated to VM (number of vCPUs, memory, disk size) correspond to the deployment type (number of users serviced by application)
- When using host server with DAS storage (internal disks) Thick provisioning must be used for for vDisk of application virtual machine. Check Collaboration Virtualization Hardware guide for details: https://www.cisco.com/c/dam/en/us/td/docs/voice_ip_comm/uc_system/virtualization/collaboration-virtu...
- None of the unsupported configurations are applied to the VM on VMware layer (for example, snapshots). Please refer to the following document for a complete list of supported and unsupported VMware features per application: https://www.cisco.com/c/dam/en/us/td/docs/voice_ip_comm/uc_system/virtualization/virtualization-soft...
- There is enough IOPS to serve all VMs placed on the datastore
- No IO blocking operation is happening on the datastore which may ‘freeze’ VM for a certain period of time
- VM is not oversubscribed - number of users and applications serviced correspond to the provisioned VM resources
- There are VM Tools installed and updated on application VM of the version corresponding to the VMware vSphere version
If you checked/confirmed/fixed all the above but application continue to crash or you are unable to perform above checks and need escalated support please collect following information prior to opening TAC Service Request:
- UCS tech-support bundle (if application VM is hosted on Cisco UCS server)
- Vmware Support Logs (including performance data)
- Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
- Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
- EventViewer-Application Log and EventViewer-System log for a week period till the crash date
- Core file acquired from the system
- Output of CLI command ‘utils core active analyze <core filename>’
- Additionally, based on the crashing service collect:
  - For “Unified CM telephony subsystem” crash, make sure to collect “CCX engine logs” along with “Unified CM telephony client “logs
  - For nmon process crash, make sure to collect “Finesse Tomcat” along with “Cisco Tomcat” logs

Generic core – memory leak

From the backtrace patterns it looks like a memory leak issue. To confirm that please verify core file size. If it’s close to 4Gb file for versions 11.x+ or close to 3Gb for versions prior to 11.x then it’s a memory leak problem.
Process crash happened because OS could not allocate new portion of memory requested by application because application reached maximum allowed virtual memory size
Process backtrace did not match to any known defect and problem needs to be investigated further to find root cause for the memory leak
Please collect following dataset before proceeding with opening TAC Service Request:
- Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
- Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
- EventViewer-Application Log and EventViewer-System log for a week period till the crash date
- Core file acquired from the system
- Output of CLI command ‘utils core active analyze <core filename>’
- Additionally, based on the crashing service collect:
  - For “Unified CM telephony subsystem” crash, make sure to collect “CCX engine logs” along with “Unified CM telephony client “logs
  - For nmon process crash, make sure to collect “Finesse Tomcat” along with “Cisco Tomcat” logs
If you don’t have corresponding traces for crashing service, check other servers in the cluster to see if crashing service memory utilization is close to maximum (or unusually high) on any of the servers.
If so, complete the following steps:
- verify crashing service trace level is set to Detailed/Debug
- force a core of crashing process on the impacted server for root cause analysis (this should be done out of business hours)
- force process crash by logging into CLI and running 2 commands:

show process list (look for <pid number> of crashing service [for example, /usr/local/xcp/bin/jabberd] in the output)

delete process <pid number> crash (<pid number> copied from previous command)

- collect above dataset from impacted server and open TAC Service Request

No matches - potentially new defect (not caused by performance or memory leak)

For any unidentified crash following dataset needs to be collected prior to opening TAC Service Request:

Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
EventViewer-Application Log and EventViewer-System log for a week period till the crash date
Core file acquired from the system
Output of CLI command ‘utils core active analyze <core filename>’
Additionally, based on the crashing service collect:
- For “Unified CM telephony subsystem” crash, make sure to collect “CCX engine logs” along with “Unified CM telephony client “logs
- For nmon process crash, make sure to collect “Finesse Tomcat” along with “Cisco Tomcat” logs