cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

Troubleshooting process crashes on CUC servers

94
Views
5
Helpful
0
Comments

Introduction

 

While analyzing backtrace output for process crash on CUC servers and matching it to exiting defects you may end up with one of the following scenarios:

 

Exact backtrace match to single defect

 

  1. Check system version where backtrace was taken from to confirm that defect is impacting product deployment ('Known Fixed Releases' in BST Tool should be higher than current system version)
  2. Check defect Release Notes to see if there are specific conditions matching failing scenario
  3. Check defect Release Notes to see if there is any workaround exist to quickly fix the issue if system’s technical requirements allow to apply workaround
  4. To permanently fix the issue and avoid further service crashes consider upgrading system to the highest (or above) release specified in 'Known Fixed Releases' section of BST Tool for matching defect
  5. If the defect does not match system version (current version is higher than any 'Known Fixed Releases' in matching defect for the respective system release) that means you are potentially facing a new crash which is not known yet to Cisco engineering team
  6. For any new crash following dataset needs to be collected prior to opening TAC Service Request:
    • Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
    • Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
    • EventViewer-Application Log and EventViewer-System log for a week period till the crash date
    • Core file acquired from the system
    • Output of CLI command ‘utils core active analyze <core filename>’
    • Additionally, based on the crashing service collect:
      • For “Connection Notifier” crash, make sure to include "Connection Conversation Manager" logs along with "Connection Tomcat" and "Cisco Tomcat" logs
      • For “Connection Message Transfer Agent” crash, make sure to include "Connection Conversation Manager" logs

 

Exact backtrace match to multiple defects

 

  1. Several defects are returned which means that every defect resolves the issue where exactly same backtrace is being generated. These defects might be related to each other or have different root causes
  2. Check system version where backtrace was taken from and choose defects which are potentially impacting product deployment ('Known Fixed Releases' in BST Tool should be higher than current system version)
  3. For defects selected on the previous step check Release Notes to see if there are specific conditions matching failing scenario
  4. For defects selected on previous step check Release Notes to see if there is any workaround exist to quickly fix the issue if system’s technical requirements allow to apply workaround
  5. To permanently fix the issue and avoid further service crashes consider upgrading system to the highest (or above) release specified in 'Known Fixed Releases' section of BST Tool for any matching defect
  6. If there are no resolved defects matching system version (current version is higher than any 'Known Fixed Releases' in any matching defect for the respective system release) that means you are potentially facing a new crash which is not known yet to Cisco engineering team
  7. For any new crash following dataset needs to be collected prior to opening TAC Service Request:
    • Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
    • Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
    • EventViewer-Application Log and EventViewer-System log for a week period till the crash date
    • Core file acquired from the system
    • Output of CLI command ‘utils core active analyze <core filename>’
    • Additionally, based on the crashing service collect:
      • For “Connection Notifier” crash, make sure to include "Connection Conversation Manager" logs along with "Connection Tomcat" and "Cisco Tomcat" logs
      • For “Connection Message Transfer Agent” crash, make sure to include "Connection Conversation Manager" logs

 

Generic core – performance issue

 

  1. This coredump was generated because service process could not get necessary CPU resources to continue operation
  2. Usually this means that underlying infrastructure (hardware or virtualization layer) was not able to provide enough CPU resources when application needed to run
  3. This may happen due to the high IO, blocking operation on VM level (for example with snapshots) or intensive CPU operations within virtual machine
  4. Due to that fact you need to look for a root cause outside of the application itself, i.e.:
    • Hardware components and firmware versions on the host server are compatible with VMware vSphere version. For Cisco UCS please use this link: https://ucshcltool.cloudapps.cisco.com/public/
    • VM was deployed from the template downloaded from cisco.com website
    • VM template used to deploy application virtual machine is compatible with current VMware vSphere version
    • Resources allocated to VM (number of vCPUs, memory, disk size) correspond to the deployment type (number of users serviced by application)
    • When using host server with DAS storage (internal disks) Thick provisioning must be used for for vDisk of application virtual machine.  Check Collaboration Virtualization Hardware guide for details: https://www.cisco.com/c/dam/en/us/td/docs/voice_ip_comm/uc_system/virtualization/collaboration-virtu...
    • None of the unsupported configurations are applied to the VM on VMware layer (for example, snapshots). Please refer to the following document for a complete list of supported and unsupported VMware features per application: https://www.cisco.com/c/dam/en/us/td/docs/voice_ip_comm/uc_system/virtualization/virtualization-soft...
    • There is enough IOPS to serve all VMs placed on the datastore
    • No IO blocking operation is happening on the datastore which may ‘freeze’ VM for a certain period of time
    • VM is not oversubscribed - number of users and applications serviced correspond to the provisioned VM resources
    • There are VM Tools installed and updated on application VM of the version corresponding to the VMware vSphere version
  5. If you checked/confirmed/fixed all the above but application continue to crash or you are unable to perform above checks and need escalated support please collect following information prior to opening TAC Service Request:
    • UCS tech-support bundle (if application VM is hosted on Cisco UCS server)
    • Vmware Support Logs (including performance data)
    • From CLI of affected server run ‘file build log <crashing process>_core’ command to generate log bundle
    • Download log bundle with file get activelog <log bundle archive>

If crashing service logs are not exposed for download via ‘log bundle’ CLI command or you want to use RTMT, collect and provide:

    • Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
    • Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
    • EventViewer-Application Log and EventViewer-System log for a week period till the crash date
    • Core file acquired from the system
    • Output of CLI command ‘utils core active analyze <core filename>’
    • Additionally, based on the crashing service collect:
      • For “Connection Notifier” crash, make sure to include "Connection Conversation Manager" logs along with "Connection Tomcat" and "Cisco Tomcat" logs
      • For “Connection Message Transfer Agent” crash, make sure to include "Connection Conversation Manager" logs

 

Generic core – memory leak

 

  1. From the backtrace patterns it looks like a memory leak issue. To confirm that please verify core file size. If it’s close to 4Gb file for versions 11.x+ or close to 3Gb for versions prior to 11.x then it’s a memory leak problem.
  2. Process crash happened because OS could not allocate new portion of memory requested by application because application reached maximum allowed virtual memory size
  3. Process backtrace did not match to any known defect and problem needs to be investigated further to find root cause for the memory leak
  4. Please collect following dataset before proceeding with opening TAC Service Request:
    • Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
    • Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
    • EventViewer-Application Log and EventViewer-System log for a week period till the crash date
    • Core file acquired from the system
    • Output of CLI command ‘utils core active analyze <core filename>’
    • Additionally, based on the crashing service collect:
      • For “Connection Notifier” crash, make sure to include "Connection Conversation Manager" logs along with "Connection Tomcat" and "Cisco Tomcat" logs
      • For “Connection Message Transfer Agent” crash, make sure to include "Connection Conversation Manager" logs

  5. If you don’t have corresponding traces for crashing service, check other servers in the cluster to see if crashing service memory utilization is close to maximum (or unusually high) on any of the servers.

    If so, complete the following steps:

    • verify crashing service trace level is set to Detailed/Debug
    • force a core of crashing process on the impacted server for root cause analysis (this should be done out of business hours)
    • force process crash by logging into CLI and running 2 commands:

show process list (look for <pid number> of crashing service [for example, /usr/local/xcp/bin/jabberd] in the output)

delete process <pid number> crash (<pid number> copied from previous command)

    • collect above dataset from impacted server and open TAC Service Request

 

No matches - potentially new defect (not caused by performance or memory leak)

 

For any unidentified crash following dataset needs to be collected prior to opening TAC Service Request:

  • Detailed/Debug level traces of <crashing process> spanning from -30 minutes before the crash to +30 minutes after the crash
  • Cisco RIS Data Collector PerfMon Logs for a week period till the crash date
  • EventViewer-Application Log and EventViewer-System log for a week period till the crash date
  • Core file acquired from the system
  • Output of CLI command ‘utils core active analyze <core filename>’
  • Additionally, based on the crashing service collect:
    • For “Connection Notifier” crash, make sure to include "Connection Conversation Manager" logs along with "Connection Tomcat" and "Cisco Tomcat" logs
    • For “Connection Message Transfer Agent” crash, make sure to include "Connection Conversation Manager" logs

 

Content for Community-Ad