[RCA] 10/21/14 Cisco ESA CASE Update outage Final

Robert Sherwin · ‎10-24-2014

Root Cause Analysis (RCA) Report

Summary of Incident Description

On Oct 21, 2014, Cisco’s Anti-Spam backend infrastructure experienced network downtime that caused the inability for the Email Security Appliances (ESA) to pull down updates for a few hours. This inability to update exposed a performance issue where CPU utilization increases to a point where the mail work queue gets overloaded, causing mail delivery to be delayed.

Incident Categorization

Incident Date(s): 10/21/2014 Incident Type: Service impacting Severity Level: 1

Detection of Incident

Detection of the incident occurred upon receipt of multiple customer escalations to TAC.

Timeline:

*Time (PDT)*	From T0	Details
23:30 (Oct 20)	T0 + 00:00	Network issue occurs in Cisco data center affecting multiple internal systems. Anti-Spam updates frequency was affected but was not stopped. Spam processing on Customer ESAs is still operating as expected.
04:50 (Oct 21)	T0 + 05:20	Anti-Spam updates stopped. Depending on load, spam processing begins to slow down and customer ESA work queues begin to backup.
06:39	T0 + 07:09	TAC begins to receive customer escalations related to work queue backups.
07:02	T0 + 07:32	Network fixes result in Anti-Spam updates partially restored. Customer ESAs update within 5-10 minutes, spam scanning resumes normal operation, and ESAs begin to drain work queue backups. The majority of customer’s work queues returned to normal mail processing within 20-30 minutes of the update.
12:30	T0 + 13:00	All network issues resolved. Anti-Spam updates fully restored to normal frequency.

Root Cause

Cisco’s AntiSpam backend infrastructure suffered a customer-impacting outage on Oct. 21, 2014. Initial investigations suggest that the root cause is associated with a required Network VIP maintenance effort conducted on Oct. 20, 2014 at approximately 11:30pm PDT. Further complications lead to an application-based outage at approximately at 4:00 am on Oct. 21, affecting more than 50 external Email Security customers. Cisco has fixed the underlying problem and services have been restored on Oct 21 at around 12:30pm. To understand the complexities of this outage, we have been able to generate the following list of relevant details.

Preliminary Details:

The aforementioned infrastructure’s Monitoring team alerted the Network Operations team at approximately Oct 20th 11pm PDT that production VIPs associated with VLAN103 were exhibiting a similar problem to an earlier remediated VLAN603 issue. The VLAN603 issue involved routers experiencing trouble with relearning the MAC addresses on VLAN hosts for proper routing using the DSR protocol. It was remediated by registering MAC addressees in route tables statically. Applying the same logic to VLAN103 appears to remediate the alerts generated from the Monitoring systems. The Network Operations team fixed the problem of static MAC addresses at 12:30pm of Oct 21.
An initial investigation suggests that a number of production applications responsible for email AntiSpam updates started having problems in the early hours of Oct 21. One such application was an update packager that serves the purpose of packaging AntiSpam rules for to Email Security solutions via an updater service.
Email Security Solutions rely on robust cloud services to provide updates that are available at all times. In this incident, since updates were not available from the cloud, a potential bug on the Email Security solution caused the scanning process memory to grow beyond available resources. Depending on the volume of incoming emails, the growth in memory could potentially trigger a bottleneck in the mail handling service that leads to restricted or stopped mail delivery. TAC received the first reports of these incidents from customers that noticed a stoppage or slowness in email flow at around 4:00am of Oct 21.
The Network Operations team were able to solve the root of the problem involving the MAC address expiration settings on the Nexus 7K routers. As a result, the condition appears to have been fixed at approximately 12:30pm and customer queues began to clear as updates began to flow again. The Network Operations team has identified that the root cause for the intermittent connectivity issue was the synchronization of MAC address aging settings on the Nexus 5K and Nexus 7K routers.

Corrective Actions

As Cisco continues to analyze and investigate, the defects below will contain the details of the Corrective Actions to prevent this outage in Anti-Spam services from happening again. Updates to this document shall continue until marked as FINAL.

Current defects that are attributed to this issue are:

CSCzv10857: When CASE is not updating, the service become unstable

CSCur40350: Update services unavailable cause email process to halt