cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
773
Views
5
Helpful
7
Replies

Upgrade 11.5(1) to 11.5(1)SU6 broke last night (IPSEC)

RAustin70
Level 1
Level 1

Well, it happened.  Dammit.

 

     Last night I began my upgrade to SU6 after a couple weeks of preparation, stressing, and praying.  Started off Great, UCM Pub upgraded in just under 2 hours YAAY!!  Kicked off Sequence 2 of the guide (Upgrade Primary and Secondary Subs, and CUP Pub in parallel) and they all failed pretty much within the first 15 minutes.

 

     I spent a couple hours with Cisco TAC via WebEX pouring over everything they could think of to no avail.  They told me my network was broken and I had to have my Network Team come fix it.  Uh, What?

 

     Here are my symptoms:  All subs and other nodes can ping each other as well as the Gateways, my VG's, ISR's, etc.  None of my Subs can ping my Pub.  My pub can ping everything under the sun Except my Subs and my CUC servers.  Two of the four Subs are on the same ESXi Host and Subnet as my Pub....But The tech says my network was broken.  SMH

 

     By this time it was stretching into a 16 hour workday so I went home to grab three hours.  Driving in it hit me....IPSEC Policies!!  Sure enough, when I got in every single device that the Pub could not ping we on an IPSEC Transport poilicy.  Dammit.  No where in any Release notes, Upgrade documentation, or even readme's said anything about IPSEC except for a blurb about if you are upgrading to 11.5(1)SU6 from Release 6.1(5)!!!

 

     I reached out to a CCIE Voice buddy of mine and he quickly came back with this tidbit I wanted to share with all of you:

 

"There is a bug that started somewhere is 11.5 in which Red Hat stopped supporting OpenSwan (IPSec). Cisco didn’t catch this until 12.x and continued to use it after the support ended. This was fixed on 12.x with Libreswan. The bug ID is CSCvc16004. This was supposed the be back channeled into 11.5 to be fixed but I have not seen if this happened or not. The effects of this issue are intermittent, meaning we have customers on 11.5 with IPSec enable without issue, while others have some wonky things going on."

 

     So that is where I am at right now, I know what is most likely the issue, and waiting on Cisco TAC to call me back so I can share this with them and develop a plan to get my systems back up and behaving.  Hoping a simple Pub reboot (with old version still active) works, or remove the IPSEC Policies, or <insert fix here>

 

     I will post the fix when it comes so the next time someone has an issue similar to this, hopefully I will save them some stress.

 

1 Accepted Solution

Accepted Solutions

Upgrade is completed, have a couple lessons learned from the experience.

 

     If you are using IPSEC tunnels that are not using self-signed certificates, that looks to cause issues between the nodes.  Disable the policies via GUI and verify database replication before beginning the upgrade.

 

     If you are using Tomcat that is not using Self-Signed certificates, that also looks to cause issues during the upgrade (Web Browser stops responding)  If that happens, jump into CLI if you are not already there and run utils service restart Cisco Tomcat.  The Browser will begin reporting again.  You can also monitor the upgrade via CLI bu typing utils system upgrade status

View solution in original post

7 Replies 7

Anthony Holloway
Cisco Employee
Cisco Employee

Damn!  I'm sorry to hear that.  I can't wait to hear what TAC has to say about the IPsec theory.

 

Are you using IPsec between CUCM nodes?  If so, why?  Just curious, as I have never seen this before.

Yes we are (were) using IPSEC Transport mode between all CUCM nodes as well as CUC nodes per the Military Unique Deployment Guide (MUD-G) we have to follow.  Not my choice I promise you lol.

 

I have everything talking again by disabling the policies, but some nodes I had to enable/disable the policy a time or two to get it working right.

 

TAC has shuffled me between three engineers so far, with only the first one even talking to me so I am going this one alone pretty much.

 

Now that i have everything up, talking and utils dbreplication runtimestate shows 2's across the board, I think I am ready to proceed with the upgrade process, but now the question becomes:  what do I do with the 4 subscriber nodes that failed last night?  Do I reboot them and try again, or just try again?  Have to find that out.

 

Rob

Oh, right, US military. Makes sense why I've never seen anyone do that before.

Yes, that's the typical action: reboot and try again. Good luck!

Upgrade is completed, have a couple lessons learned from the experience.

 

     If you are using IPSEC tunnels that are not using self-signed certificates, that looks to cause issues between the nodes.  Disable the policies via GUI and verify database replication before beginning the upgrade.

 

     If you are using Tomcat that is not using Self-Signed certificates, that also looks to cause issues during the upgrade (Web Browser stops responding)  If that happens, jump into CLI if you are not already there and run utils service restart Cisco Tomcat.  The Browser will begin reporting again.  You can also monitor the upgrade via CLI bu typing utils system upgrade status

I have never seen an upgrade hang with CA signed Tomcat certs, so this could be a one-off issue you faced. Tomcat does take a really long time to respond, even after the service is stated as being STARTED.

You can also monitor the upgrade...at a very high level, with: file tail install system-history.log This also works for backups and switch versions too. As a bonus, this file grows of time, and you can see the upgrade history of a system with file search install system-history.log "Upgrade.*Success"

Correct, the upgrade indeed didn't hang, but the WebGUI visual indications were that it hung because the install log wasn't scrolling along.  So I went in and restarted the service and it began working again.

 

I found this out because post upgrade, my Publisher WebGUI stopped working.  When I went in and ran a utils diagnose test the tomcat_connect failed.

 

Rob

Oooh, you kicked off the upgrade in the GUI, and that's what hung. I see. Sorry, I thought you meant like after the switch version, Tomcat was hung. I kick off 100% of my upgrades via the Console, for two reasons:

1) It's not tied to my laptop (or any machine) so I can suffer a blue screen of death, or just unplug and the upgrade keeps running

2) It's not tied to the browser working, which are notorious for flaking out (it's 2019 people, let's get command over web pages already)