cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

Chalk Talk: What Not To Do With UCCX

1391
Views
0
Helpful
0
Comments
Enthusiast

At  TAC, we know that a Contact Center Express environment is rather  sensitive; it needs careful handling, especially since the High  Availability UCCX cluster is often the main “face” of our customers; the  cash cow, so to speak.

We know that our customers handle everything to do with networking and  Cisco at their end, and we try to help by explaining as much as  possible, pointing out good reference documents, and explaining what we  do and why. This helps you all resolve your issues before you need to  reach out to TAC, and hence this article: “What Not To Do With UCCX”.

DRS Backups

We  have a great backup and restore system that is as good as a snapshot  built into CUCM and UCCX. However, when we ask “How old is your  last-known good backup?”, we seldom hear what we want to.

Taking  frequent backups is a good thing to do; it's a great way to ensure that  a catastrophe does not cause your historical reporting, or those  skillsets and application configs that you so carefully fine-tuned to  vanish. Whether you take scheduled backups or a manual backup, always  spare five minutes to read the Cisco Unified CCX DRF logs and ensure  that the backups are successful. Just searching for "error",  “exception”, or "fail" using WinGrep or a similar text file search tool  is good enough.

We saw quite a few customers face major issues with silently-failing DRS backups, documented in Field Notice 63565 (http://www.cisco.com/en/US/ts/fn/635/fn63565.html):  the backups would be reported as “successful”, but a crucial component  (the LDAP database, A.K.A. the CAD component) would silently fail.  Months later, when customers tried to restore these backups, they would  be faced with a blank LDAP database and no way to recover it unless the  High Availability server had not been touched by the DRS restore.

This  of course meant that agents and supervisors could not log in, and all  workflows, skill information, and resource information was lost. All  that pain could have easily been avoided by simply grepping   through the backup logs for "error", “exception”, or "fail" and  contacting TAC when it turned up positive. In fact, if you use OpenSSH  on Cygwin or a Linux server, you can automate this grepping process with  a simple BASH script.

Switch Version Failure

Another  way data often gets lost is when customers lose patience with the  switch-version section of a UCCX 8.x or 9.x upgrade: the switch-version  process does not show a progress update and all too often, customers  reboot the server because they think the process “hung”. During a  switch-version, all data from the old version is copied across to the  new version. Interrupting this process will result in the data becoming  corrupt, and unusable. Switching version to UCCX 8.5(1)SU3 and above  includes an option to recover from this. However, it is simply better to  prevent; you can read the logs instead of wondering whether the process  hung. Use the CLI command

“file tail install /uccx-install.log”


to  get a scrolling display of the current status of the switch-version,  from the install logs. That will give you a continuous progress update.

tailInstall.png

Figure one: scrolling progress for "utils system switch-version"

Reading Logs

Now  that logs have been mentioned, let's talk about reading them. The UCCX  Guides are a great place to start learning how to read logs. You can  start with the Command Line Reference, from where you can get commands  to view and search through the relevant logs. You can also use the UCCX  Serviceability & Administration Guide to determine which logs to  read for what issue and how to set the logging levels so that you get  the information you need.

Again,  searching for "error", "exception", or "fail" is a good way to start,  although you will get false positives: “Exception=null” just means that  there was no exception. Of course, if the Unified CM Telephony Subsystem  is in PARTIAL SERVICE, just search for “partial” in the logs of its  parent service: the Cisco Unified CCX Engine. The Serviceability &  Administration Guide will tell you that “SS_TEL” refers to this  subsystem, and searching for “partial” will show you lines where SS_TEL  explains why it is in partial service.

Of  course, you can also use the RTMT “Telephony Data” feature under the  “Cisco Unified CCX” tab to identify why your Unified CM Telephony  Subsystem is in PARTIAL SERVICE. This feature will clearly tell you  which triggers (CTI Route Points), call control group members (CTI  Ports), or JTAPI Provider users are in trouble and need to be fixed.

Log  analysis is quite easy, once you understand how to go about it. Of  course, you may want to start slow, perhaps by analyzing a log file  where only 1 call is present. Here's a simple way to read a UCCX call as  you would read a debug on a voice gateway:

  1. Find  out which UCCX Engine log file is currently active, i.e. which log file  is currently open. We know that UCCX Engine logs are named  “Cisco001MIVRxxxx.log”. So, running this command should tell us which is  the current active log:

    show open files regexp Cisco001MIVR

  2. Next,  get all the telephony-related information in a scrolling display. You  run “file tail” again as with the switch-version, but this time use the  Engine log and include an option to only display lines matching the  regular expression “SS_TEL”, which is the Unified CM Telephony  Subsystem.

    file tail activelog /uccx/log/MIVR/Cisco001MIVRxxxx.log regexp SS_TEL

  3. Dial the UCCX trigger number and see the call scroll on your screen.

showCall.png

Figure 2: Scrolling progress for Engine Telephony Subsystem logs, with a call in progress

The  Control Center and Datastore Control Center in the Cisco Unified CCX  Serviceability portal are the best places to start gathering information  on what's going wrong with your system.

Ensure  that all your services in the Control Center show up in the status “IN  SERVICE”. If any show a different status, use the drop-down button  against the service name to identify which child service is not IN  SERVICE.

For  example: if your agents are unable to login, check the status of the  RmCm subsystem under the Cisco Unified CCX Engine, Subsystem Manager.

The  Datastore Control Center talks about database replication. If  historical reports say that there is no data when the Publisher is the  Master Node, but produce data when the Subscriber has Mastership, check  the replication status.

In the “Replication Servers” page, both servers should be listed as “Active” and “Connected”.

In  the “Datastores” page, each datastore should have read and write access  enabled, and the replication status should be “Running”.

Network Time Prototocol

The  importance of logs cannot be underlined enough. Log analysis is how TAC  operates most of the time, especially when providing the root cause for  issues that happened in the past and no longer occur. As UCCX is  integrated with CUCM, TAC often needs to read logs from both products to  fully understand the situation, and provide the most effective  solution. It is essential to get the right time-frame from the customer  for log analysis, as this helps narrow the focus and also ensure that  the same calls are looked at within UCCX as well as in CUCM. However,  time loses all meaning when the servers are not on the right time, and  are in a situation of time drift. Unified Communications products use  the Network Time Protocol (NTP) to ensure that the correct time is  always maintained.

NTP  is highly important for a Contact Center Express deployment: not only  for correctness of logs, but also for historical reporting (imagine  calls being reported as days-long because of NTP drift!) and with all  communications going outside the UCCX server.

When  your NTP goes out of sync on the UCCX/CUCM or worse, when your NTP  server itself goes awry and starts reporting the wrong time, havoc  breaks out. I have seen deployments where agents are unable to login  because their PCs are reporting a different time than the UCCX, and all  callers to UCCX hear a reorder tone because CUCM and UCCX are on  different times.

UCCX  sends a message timestamped “2013, March, 1st, 1200 hours” and the  current time on CUCM is “2013, March, 2nd, 1200 hours”. CUCM thinks that  the message received is a day old, and of course ignores it. Hence,  CUCM and UCCX, as well as CAD and UCCX, are unable to talk to each  other.

The  lesson here? Always ensure that NTP is in sync on your CUCM and UCCX,  and that the UTC time on UCCX and CUCM is the same. Try to use the same  NTP server(s) for both, so that they follow the same leader.

The command

“utils ntp status”

will give you the NTP server which is currently being referenced and the current system time.

Please  ensure that you only use Linux servers or IOS gateways as NTP sources;  Windows NTP servers use a slightly different version of the protocol,  and this will cause issues. Also, ensure that the stratum of the NTP  server used is within 1-4.

ntp.png

Figure 3:

High Availability over WAN

CUCM-UCCX  communication and intra-UCCX communication should experience as little  lag as possible. Customers who have deployed the High Availability Over  WAN (HAoWAN) option will be familiar with this requirement.

However,  we at TAC often see that the server communication is hampered by a very  small mistake, which also prevents the HAoWAN feature from working as  designed. Let's talk about HAoWAN before I get to this mistake.

There  are 2 sites in a HAoWAN deployment, separated by a WAN link. This  deployment option is designed such that in case of a WAN link failure,  both sites will split into independent UCCX sites: they will operate in  “island mode” till the WAN link comes back up and the cluster can  communicate within itself again.

Logically,  it is understood that there should be a CUCM and a UCCX at either site,  at the bare minimum. Imagine this scenario where the WAN link goes down  between Sites A and B.

  1. Site  A's UCCX initially refuses to communicate with its CUCM, but instead  reaches out to Site B's CUCM first. Site B's UCCX does the same,  preferring to communicate with Site A's CUCM. Valuable time is lost in  recovering from the WAN link failure.
  2. When callers are put on  hold by Site A, they hear the familiar hold music, pleasant and soothing  as always. However, callers who reach Site B hear only a disquieting  silence, and drop out of queue in bunches because they suspect that  their calls got hung.
  3. Callers who reach Site A get transcoded  from G.729 to G.711 with no issues. However, callers who reach Site B  get disconnected because their calls are not getting transcoded.

All  this could have been avoided by following a simple mantra: Ensure that  the UCCX first reaches out to the local CUCM. This entails the following  configuration:

  1. The  local CUCM is listed as the first AXL Provider, Unified CM Telephony  Provider, and the RmCm provider for each UCCX node. This can be found in  Appadmin > System > Cisco Unified CM Configuration.
  2. The  Call Control Group configuration in HAoWAN is split in two: one half per  UCCX. Ensure that the Device Pool and Hold Audio Sources selected for  each half point to the local CUCM, and to local sources.

Unless  the HA over WAN Best Practices are adhered to, the cluster will  misbehave during a WAN outage, and WAN bandwidth consumption will  needlessly shoot up.

Let's  go back to what my colleague said: “If you don't have a phone, you  cannot make calls”. This may seem fairly obvious, yes. However, what  about a situation where the agent phone is present and it does have the  IPCC extension on it, but UCCX does not know that this is the phone to  which the call must be delivered?

I'm  talking about sharing UCCX directory numbers, which is an unsupported  configuration. UCCX requires that the directory numbers are unique  across any partition: a CUCM Route Plan Report search (found under the  Call Routing menu) for the DN should return exactly 1 result.

This applies to CTI Ports (Call Control Groups), CTI Route Points (triggers), and agent phone extensions.

This  rule goes for the non-ACD extensions on the agent phones, too. The  reasoning behind this? UCCX places JTAPI observers on these so that  Cisco Agent Desktop can recognize when the agent is on a non-ACD call.  Sharing a line with a JTAPI observer on it will result in the observer  getting confused.

Troubleshooting

We  really love it when we get a case with all the initial data already  present in it. As this helps us in faster troubleshooting, it will help  bring your call center back up faster.

This includes:

  • The exact UCCX version string (this can be 7.0(1)SR05_Build504, 8.5(1)SU4, or 8.5.1.11004-25).
  • Whether it is a single node or a HA cluster.
  • The time when the issue started happening, or at least the time when the issue was first reported.
  • The  frequency of the issue, and the wideness of its impact. For example:  are all agents impacted, or only some? If only some, what makes them  different from those who are impacted?
  • What changed just prior  to the start of the issue. If you are aware of any changes to the  network, the CUCM, the UCCX, or anything even slightly related to the  UCCX, do let us know. It's better for us to get irrelevant information,  than to miss relevant information. I once had a case where the agent PCs  were entering sleep mode, causing CAD to disconnect! If only the  network admin could have informed me of that...
  • Any steps you have taken to try correcting the issue, or working around it.

Updates

Cisco  regularly releases Service Updates (SUs) for UCCX, which contain  important bug fixes and product enhancements. We also release Field  Notices whenever issues of high importance crop up and we want our  customers to be aware of them. Please try to be on the latest SU  release; this is a proactive step you'll be taking towards protecting  and nurturing the health of your Contact Center. Keep an eye out  especially for Field Notices, and do follow the instructions closely  when one comes out.

Conclusion

I  have talked about DRS backups, understanding and reading logs and  information using the tools and guides provided, and ensuring that best  practices are followed. Hope this gave you food for thought, at least  for those waits on hold with the TAC Front Line!

About the Author

Presently in the APAC shift, Anirudh Ramachandran has been working with Unified Contact Center Express TAC since the day  he joined Cisco. While not solving service requests, he enjoys tea,  photography, Linux, and using the kitchen as a laboratory.

CreatePlease to create content