cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

Troubleshooting: Error: Sync after switch version failed (from 8.x to 9.x)

7637
Views
5
Helpful
7
Comments

Problem:

Upgrading and doing a switch version from 8.x to 9.x.

The switch version fails saying Error: Sync after switch version failed.

Logs:

Collect the install/upgrade logs and have a look at the point where it failed. The most important logs files in this are uccx-install.log and system-history.log.

Issue:

In 9.x, there is a minor change in the Informix IDS version which leads to a couple of problems during the switch version. The above error may also occur when there is a reboot during the middle of switch version which cannot be compromised and can lead to a lot of issues with the DB and may involve the system to be rebuilt. This is not supported by Cisco.

Root Cause:

There may be 3 causes for the above error:

1. CSCue38031 - This occurs when the database on tyhe inactive partition (9.x) is started after moving the dbspaces which leads to heuristic transactions.

Workaround: Contact TAC to remove these transactions from the DB.

uccx-install.log:

Validating chunks...succeeded

Initialize Async Log Flusher...succeeded

Starting B-tree Scanner...succeeded

Initializing DBSPACETEMP list...succeeded

oninit: Fatal error in shared memory initialization

WARNING: server initialization failed, or possibly timed out

shared memory not initialized for INFORMIXSERVER 'sdipccxprd01_uccx'

online.uccx.log (from partB):

04:48:07  Open transaction detected when changing log versions.

04:48:07  Cannot Rollforward from Checkpoint.

04:48:07  oninit: Fatal error in shared memory initialization

2. Timeout when starting the IDS on partB. This is documented in CSCuf48469

uccx-install.log

Initializing DBSPACETEMP list...succeeded

Checking database partition index...succeeded

Initializing dataskip structure...succeeded

Checking for temporary tables to drop...succeeded

Forking onmode_mon thread...succeeded

Creating periodic thread...succeeded

Verbose output complete: mode = 1

WARNING: server initialization failed, or possibly timed out (if -w was used).

Check the message log, online.log, for errors.

1809: Server rejected the connection.

online.uccx.log (from partB)

listener-thread: err = -27010: oserr = 0: errstr = : Only an administrative user or informix can connect in single user mode.

Workaround: Contact TAC for applying the patch

3. Reboot during the middle of switch version - The Switch version process takes a long time depending on the size of the Database. When I say size, it does not involve only the configuration, it involves the Historical DB as well. (CDS, HDS, RDS, ADS).

So we tend to reboot the server during the middle of switch verison which can corrupt the DB. The system will go for an automatic reboot after the switch version and thus wait for it to complete.

Workaround: Fastest method will be rebuilding the server and restoring from a DRS backup

I have seen it take 15 minutes to as huge as 8 hours!!!

Comments
Beginner

I tried to upgrade am UCCX HA cluster from 8.0(2)SU4 to 9.0(2)SU1 and got the "Error: Sync after switch version failed." when doing the switch version on the secondary/subscriber node. First node completed fine.

It seems like the outstanding errors on the install log for this case, do not match with any of the possible causes you mention in your post above. Below the errors I got during this process:

As a side note, I upgraded this same system in a test environment (isolated vlan, same database, just like 4 days older than the one in production, in which no calls were received during the upgrade) last week, and the switch version completed successfully on both nodes. Should I attempt to do this in production?, Could it be a possble cause of failure the fact that in production, the system were receiving a few calls on the IVR during the upgrade?

Would you please help in determine how can I fix this issue?. Please let me know if you need the complete install log and the online.uccx.log files.

I have a TAC case opened, but I havent got a response in three days now, on what happened and the best way forward.

Thanks,

*****************************************************************

.

.

.

12100000 Row(s) loaded so far to table contactcalldetail.

12200000 Row(s) loaded so far to table contactcalldetail.

Table contactcalldetail had 12206558 row(s) loaded into it.

Data Migration done ..

------------ Done ----------------

Sun Jul 14 01:17:56 CDT 2013 :: /opt/cisco/uccx/sql/rds_delta_802_to_803.sql is not available

Sun Jul 14 01:17:56 CDT 2013 :: /opt/cisco/uccx/sql/fds_delta_802_to_803.sql is not available

Sun Jul 14 01:17:56 CDT 2013 :: running file /opt/cisco/uccx/sql/cra_delta_803_to_804.sql

Applying the migration script: /opt/cisco/uccx/sql/cra_delta_803_to_804.sql

Sourcing IDS environment variables

Database selected for migration: db_cra

Removing the older migration command file: cmd_cra_delta_803_to_804_db_cra_20130714

old command file does not exist

alter table contactcalldetail add(dialinglistid int);

SQL here --- output to pipe cat without headings select distinct cdrserver from contactcalldetail

The table contactcalldetail has the replication set on it

SQL command used for unloading the data: unload to "/tmp/contactcalldetail.dat" delimiter ";"  select * from contactcalldetail

Unload of the data from the table : contactcalldetail has succeeded

Truncating the table: contactcalldetail

Truncating the table contactcalldetail failed

Sun Jul 14 01:26:49 CDT 2013 :: Error updating schema

Sun Jul 14 01:26:49 CDT 2013 :: Stopping DB

Sun Jul 14 01:26:49 CDT 2013 :: ------Stopping uccx database-------

Sun Jul 14 01:27:03 CDT 2013 :: Waiting for port to be released

(No info could be read for "-p": geteuid()=0 but you should be root.)

(No info could be read for "-p": geteuid()=0 but you should be root.)

tcp        0      0 10.64.8.26:42652            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:43158            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:1504             10.64.8.26:43142            TIME_WAIT   -                  

Sun Jul 14 01:27:03 CDT 2013 :: Try 0: waiting for 10 seconds

(No info could be read for "-p": geteuid()=0 but you should be root.)

(No info could be read for "-p": geteuid()=0 but you should be root.)

tcp        0      0 10.64.8.26:42652            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:43158            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:1504             10.64.8.26:43142            TIME_WAIT   -                  

Sun Jul 14 01:27:14 CDT 2013 :: Try 1: waiting for 10 seconds

(No info could be read for "-p": geteuid()=0 but you should be root.)

(No info could be read for "-p": geteuid()=0 but you should be root.)

tcp        0      0 10.64.8.26:42652            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:43158            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:1504             10.64.8.26:43142            TIME_WAIT   -                  

Sun Jul 14 01:27:24 CDT 2013 :: Try 2: waiting for 10 seconds

(No info could be read for "-p": geteuid()=0 but you should be root.)

(No info could be read for "-p": geteuid()=0 but you should be root.)

tcp        0      0 10.64.8.26:42652            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:43158            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:1504             10.64.8.26:43142            TIME_WAIT   -                  

Sun Jul 14 01:27:34 CDT 2013 :: Try 3: waiting for 10 seconds

(No info could be read for "-p": geteuid()=0 but you should be root.)

(No info could be read for "-p": geteuid()=0 but you should be root.)

tcp        0      0 10.64.8.26:42652            10.64.8.26:1504             TIME_WAIT   -                  

tcp        0      0 10.64.8.26:43158            10.64.8.26:1504             TIME_WAIT   -                  

Sun Jul 14 01:27:44 CDT 2013 :: Try 4: waiting for 10 seconds

(No info could be read for "-p": geteuid()=0 but you should be root.)

(No info could be read for "-p": geteuid()=0 but you should be root.)

Sun Jul 14 01:27:54 CDT 2013 :: The port is released

Sun Jul 14 01:27:54 CDT 2013 :: ------UCCX database stopped--------

Sun Jul 14 01:27:54 CDT 2013 :: DB upgrade script failed

Sun Jul 14 01:27:54 CDT 2013 :: Restoring repicaition status of the databse

Unified CCX Database is currently not on-line.

The requested operation will not be performed.

Sun Jul 14 01:27:55 CDT 2013 :: ./uccx_sv_db.sh 8.0.2.11005-20 9.0.2.11001-24 8.0.2.11005 rpmdb: Program version 4.2 doesn't match environment version error: db4 error(22) from dbenv->open: Invalid argument error: cannot open Packages index using db3 - Invalid argument (22) error: cannot open Packages database in /partB/var/lib/rpm package UCCX02_lib is not installed /var/log/active/platform/log/cli.log

Script uccx_sv_db.sh failed with exit code 255.

Sun Jul 14 01:27:55 CDT 2013 :: Staring command: /partB/opt/cisco/uccx/bin/uccx_db_l2_rollback.sh installFailureRollBack 8.0.2.11005-20 9.0.2.11001-24

Sun Jul 14 01:27:55 CDT 2013 :: In L2 upgrade DB rollback script running command installFailureRollBack

Sun Jul 14 01:27:55 CDT 2013 :: Setting environment variables after chroot

Sun Jul 14 01:27:55 CDT 2013 :: INFORMIXSERVER=ipccserv2_uccx

Sun Jul 14 01:27:55 CDT 2013 :: ------Stopping uccx database-------

shared memory not initialized for INFORMIXSERVER 'ipccserv2_uccx'

Sun Jul 14 01:27:55 CDT 2013 :: Waiting for port to be released

Sun Jul 14 01:27:55 CDT 2013 :: The port is released

Sun Jul 14 01:27:55 CDT 2013 :: ------UCCX database stopped--------

Sun Jul 14 01:27:55 CDT 2013 :: Attempting Restore of DB from backup.

Sun Jul 14 01:27:55 CDT 2013 :: comparing uccx versions

Sun Jul 14 01:27:55 CDT 2013 :: result of comparing version = 1

Sun Jul 14 01:27:55 CDT 2013 :: Restoring backup of DB

Sun Jul 14 01:27:55 CDT 2013 :: Restoring from upgrade backup

Sun Jul 14 01:36:34 CDT 2013 :: Stopping DB after performing the DB restore from backup

Sun Jul 14 01:36:34 CDT 2013 :: ------Stopping uccx database-------

Sun Jul 14 01:36:38 CDT 2013 :: Waiting for port to be released

tcp        0      0 10.64.8.26:43253            10.64.8.26:1504             TIME_WAIT   -                  

Sun Jul 14 01:36:38 CDT 2013 :: Try 0: waiting for 10 seconds

tcp        0      0 10.64.8.26:43253            10.64.8.26:1504             TIME_WAIT   -                  

Sun Jul 14 01:36:48 CDT 2013 :: Try 1: waiting for 10 seconds

tcp        0      0 10.64.8.26:43253            10.64.8.26:1504             TIME_WAIT   -                  

Sun Jul 14 01:36:58 CDT 2013 :: Try 2: waiting for 10 seconds

tcp        0      0 10.64.8.26:43253            10.64.8.26:1504             TIME_WAIT   -                  

Sun Jul 14 01:37:09 CDT 2013 :: Try 3: waiting for 10 seconds

tcp        0      0 10.64.8.26:43253            10.64.8.26:1504             TIME_WAIT   -                  

Sun Jul 14 01:37:19 CDT 2013 :: Try 4: waiting for 10 seconds

Sun Jul 14 01:37:29 CDT 2013 :: The port is released

Sun Jul 14 01:37:29 CDT 2013 :: ------UCCX database stopped--------

Sun Jul 14 01:37:29 CDT 2013 :: Staring command: /partB/opt/cisco/uccx/bin/uccx_db_l2_rollback.sh cleanupTempFiles 8.0.2.11005-20 9.0.2.11001-24

Sun Jul 14 01:37:29 CDT 2013 :: In L2 upgrade DB rollback script running command cleanupTempFiles

Sun Jul 14 01:37:29 CDT 2013 :: Cleaning up temporary files

Sun Jul 14 01:37:29 CDT 2013 :: comparing uccx versions

Sun Jul 14 01:37:29 CDT 2013 :: result of comparing version = 1

*********************************************************************

Cisco Employee

Hi Sanotto,

Looking at the log snippets that have been posted, I do see that the contact call detail table is huge!!. I think the switch version should have taken a couple of hours before it failed

12200000 Row(s)

So there is a DB LOAD utility run which will unload the contents of the DB to a .dat file and then truncate the tables.

unload to "/tmp/contactcalldetail.dat" delimiter ";"  select * from contactcalldetail;

Post this the truncate of the tables is done using:

truncate table contactcalldetail;

Now when this is being done, if you are having active calls into the system, then there will records that will be trying to be written to contactcalldetail which may cause the truncate to fail.

This (along with multiple other reasons) is why we ask the downtime to be present during the switch version. I would advise you to try with a downtime and if it still fails we need to look at it from the TAC case perspective.

Regards,

Arundeep

Beginner

Thanks for your response Arundeep,

The switch version took about 2.5 hours per node indeed. Also, there was a downtime, it's just that the system received a few calls on the IVR while performing the upgrades (no agents, no supervisors, no admins active, etc).

A few questions:

1.- Would you recommend to purge the db_hist database? even when the free space percentage on this database free is 65%?. How much small should the database be in order for the switch version to complete successfully?

admin:show uccx dbserver disk

SNO. DATABASE NAME      TOTAL SIZE (MB) USED SIZE (MB) FREE SIZE (MB) PERCENT FREE

---- ------------------ --------------- -------------- -------------- ------------

1    rootdbs                     358.4           58.4          300.0          83%

2    log_dbs                     317.4          307.3           10.1           3%

3    db_cra                      512.0           34.7          477.3          93%

4    db_hist                   34508.6        11875.5        22633.2          65%

5    db_cra_repository            10.2            3.8            6.4          62%

6    db_frascal                  512.0          153.4          358.6          70%

7    temp_uccx                  1572.9            0.3         1572.6          99%

8    uccx_sbspace               3145.7         2988.1          157.6           5%

9    uccx_er                     204.8            3.7          201.1          98%

10   uccx_ersb                  1572.9         1537.2           35.7           2%

11   sadmin                      102.4            3.9           98.5          96%

2.- By downtime you mean 0 calls going through the system during an upgrade?. If posistive, i think you need to be more specific about this on the documentation.

Also, as a side comment. I did open a TAC case for this issue (I engaged all possible people possible, duty managers, BU engineers, etc) and still three days after, I dont have a response, I'm receiving faster feedback from you actually. This is unaceptable on a production environment and maybe a 12 hours maintenance window, so i just decided to rollback after maybe 8 hours of work. So I need to igure out a plan before thinking on opening a TAC case. Getting to the proper resources take in the best case scenario more than 4-5 hours and in normal circumstances days.

thanks,

Beginner

Arun,

We tried the upgrade one more time in the production environment, no calls during the whole upgrade sequence, got the exact same results. We have tried this upgrade in the same system during the daytime (in a test environment) successfully under the same conditions (0 calls), and the upgrade went through without any issues.

For the production environment upgrade, the only change I could notice is the time we performed the upgrades and switch versions, as we are doing the upgrades through the midnight.

My question for you is, are there any dependencies of the upgrades or switch versions related to the day, i.e., does the upgrade and switch version for a node need to be completed the same day?.

To work around this issue, we just gave up troubleshooting the switch version failure on the second node and build the it from scratch.

Thanks,

Otto

Cisco Employee

Hi Otto,

There is no requirement considering the time of the day for the switch version. The SV shell script does not look at the system clock and hence no dependancy.

I could think of scheduled backups running at midnight and which may be the cause for the failure. Any confirmation of backups running at that time?

Also could I have a reference or the TAC case that was opened for the same? You can unicast it to me for privacy issues.

Regards,

Arundeep

Beginner

We checked that scheduled backups were not interfering with upgrade. They will not run until 3:00am Central and the two times we have faced issues with switch version was between 1:00 and 2:00pm Central. Also in the second upgrade the backup server was not reachable, the first time it was though, both same results.

I will send you the TAC case number.

Thanks,

Otto

Beginner

Hi Arundeep,

Were you able to take a look at this TAC case?

Thanks for your help.

Otto

CreatePlease to create content
Content for Community-Ad
August's Community Spotlight Awards