cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements
5449
Views
5
Helpful
9
Replies
aqrxz
Beginner

MSO Upgrade - Upgrade script abort after Node1 Kernel update

 

Hello folks,

I was doing upgrade test for MSO cluster in VM Deployment from version 3.0(2j) > 3.1(1h), all 3 VMs are running in same ESX host and connecting through same subnet. I followed upgrade procedure from this document running upgrade script from my laptop (also connecting to the same subnet as all 3 VMs node).

The script runs well and do the kernel update for MSO Node2-3 first, Each nodes were rebooted once after completed its kernel update before proceeding with the next one.

For the kernel update on Node1, it has 120s waiting time waiting for the node to come up but however the boot time for VM took more than that. The script printed out connection error to the node and abort upgrading before Network adapter came back to service. I also noted that it took ~45mins to 1 hour to bring up MSC services after rebooted longer than usual but the upgrade script keeps running like it was still waiting for MSO node. 

 

I've retried the same process several times and facing the same every time. Also tried upgrading with simple configuration.

Every time I failed the upgrade, I need to restore snapshot on all nodes before trying the new upgrade procedures again.

 

Could you advise if anyone in the forums had face an issue like this when upgrading MSO cluster before?
Am I missing something during the upgrade procedure?

 

PS D:\Downloads\mso-upgrade\tools-msc-3.1.1h> python .\msc_vm_util.py -c .\msc_cfg_upgrade.yml
Please enter MSC node1's IP address :
x.x.x.1
Please enter MSC node2's IP address :
x.x.x.2
Please enter MSC node3's IP address :
x.x.x.3
node1 IP = x.x.x.1
node2 IP = x.x.x.2
node3 IP = x.x.x.3
Please enter node1 password:
3.0.2j
Upgrade path check successful!
tools dir validation successful!
Please enter node2 password (Press Enter to use same password as node1 password):
Please enter node3 password (Press Enter to use same password as node1 password):
Extracting tools files from upgrade image file msc-3.1.1h.tar.gz
Copying msc_setup.py file to node1
Copying Node.py file to node1
Copying msc_lib.py file to node1
///// skipped ///
\Mar 30 2021 16:16:39.573 INFO: Package polkit is installed to latest.. |Mar 30 2021 16:16:40.301 INFO: Package libssh2 is installed to latest.. \Mar 30 2021 16:16:40.951 INFO: Package vim-minimal is installed to latest.. Mar 30 2021 16:16:40.962 WARNING: kernel update detected. System reboot is needed. Mar 30 2021 16:16:40.971 INFO: ############################################################. Mar 30 2021 16:16:40.976 INFO: # #. Mar 30 2021 16:16:40.981 INFO: # Some packages were updated that require system reboot. #. Mar 30 2021 16:16:40.986 INFO: # System will automatically reboot in 30 seconds #. Mar 30 2021 16:16:40.990 INFO: # Press Ctrl + C to abort automatic reboot. #. Mar 30 2021 16:16:40.998 INFO: # #. Mar 30 2021 16:16:41.003 INFO: ############################################################. ACI-Multiservice nodes rebooted after updating kernel Waiting 120 seconds for Nodes to come up .../Checking if services have converged -Error: Connection to node 'x.x.x.1' timed out. Please check if node is accessible. Aborting!
/ >>>> it stuck rotating here for a very long time

 

1 ACCEPTED SOLUTION

Accepted Solutions

Can you check the free disk space on Node-1?

df –kh

You need at least 20G free space.

 

If that looks ok, you can try to ensure the MSC version is reset back by running:

save_msc_version.sh <original MSC version>
Example: # ./opt/cisco/msc/builds/msc_3.0.3j/bin/save_msc_version.sh 3.0.3j

 Then re-attempt the upgrade process.

Robert

View solution in original post

9 REPLIES 9
aqrxz
Beginner

Update: I've tried waiting for MSO services to come up after the kernel update and tried running script again, Kernel update completed and it started the upgrade. But however, there's the error when trying to run MSO upgrade from master node.

-Mar 30 2021 17:34:05.985 INFO:   Package libssh2 is installed to latest..
|Mar 30 2021 17:34:06.635 INFO:   Package vim-minimal is installed to latest..
Mar 30 2021 17:34:06.648 INFO: *** MSC Upgrade begins ! ***.
Mar 30 2021 17:34:06.653 INFO:   get current running msc version.
\msc_setup: Error in executing cmd: cd /opt/cisco/msc/builds/msc_3.1.1h/upgrade/; ./*upgrade.sh
Error in executing msc_setup script in master node:
PS D:\Downloads\mso-upgrade\tools-msc-3.1.1h>

Do you see the upgrade.sh script file in the folder of your MSO node?

/opt/cisco/msc/builds/msc_3.1.1h/upgrade/

 Robert

Yes, the upgrade script was there

[root@mso-node-1 ~]# ls -l /opt/cisco/msc/builds/msc_3.1.1h/upgrade/
total 8
-rw-r--r--. 1 root root 1439 Feb 26 10:18 readme.txt
-rwxr-xr-x. 1 root root 3974 Feb 26 10:18 upgrade.sh
[root@mso-node-1 ~]# 

 

What does your /var/log/msc_upgrade.log show?

Robert

seems like it just stopped after verifying the packages. Had tried re-run it again couple times and it stopped at 'vim-minimal' packages as well.

Full logs in attached.

Mar 30 2021 17:31:57.200 INFO: Verify and update packages for system vulnerabilities resolution.
Mar 30 2021 17:31:57.208 INFO:   Checking for any 32-Bit packages on the system.
Mar 30 2021 17:31:58.833 INFO:   Checking if any Unwanted packages exist on the system.
Mar 30 2021 17:32:00.367 WARNING: Cleaning up existing rpms directory '/opt/cisco/msc/rpms/' and creating a fresh one..
Mar 30 2021 17:32:03.919 INFO: Successfully created rpm repository at '/opt/cisco/msc/rpms/'.
Mar 30 2021 17:32:03.926 INFO: Executing 'yum clean all'.
Mar 30 2021 17:32:04.916 INFO:   Package kernel is installed to latest..
Mar 30 2021 17:32:05.517 INFO:   Package accountsservice is installed to latest..
//skipped
Mar 30 2021 17:33:02.630 INFO:   Package yum-utils is installed to latest..
Mar 30 2021 17:33:03.263 INFO:   Package python2-pip is installed to latest..
Mar 30 2021 17:33:03.965 INFO:   Package polkit is installed to latest..
Mar 30 2021 17:33:04.635 INFO:   Package libssh2 is installed to latest..
Mar 30 2021 17:33:05.382 INFO:   Package vim-minimal is installed to latest..

Can you check the free disk space on Node-1?

df –kh

You need at least 20G free space.

 

If that looks ok, you can try to ensure the MSC version is reset back by running:

save_msc_version.sh <original MSC version>
Example: # ./opt/cisco/msc/builds/msc_3.0.3j/bin/save_msc_version.sh 3.0.3j

 Then re-attempt the upgrade process.

Robert

View solution in original post

have checked centos-root and docker overlay have ~29G available so it should be fine.

Thanks for your suggest anyway, will reattempt it again in the office tomorrow otherwise I may need to raise a TAC case to find out a solution to this.

 

In addition, we may have a high chance to hit the first problem on production cluster since it's deployed in separated DC and the node reboot time on the node will probably more than 120 sec every time.

I think this 120 sec value is hard-coded into the upgrade script and a bit too little for wait time for the OS to bring up network interfaces along with its services. Not sure if there's anyway we can avoid this? 

ACI-Multiservice nodes rebooted after updating kernel
Waiting 120 seconds for Nodes to come up .../Checking if services have converged
-Error: Connection to node 'x.x.x.1' timed out. Please check if node is accessible. Aborting!
/ >>>> it stuck rotating here for a very long time

 

Thanks a lot!! , it worked

 

I'm curious if the reason it failed to upgrade in the first time is because of MSC version hasn't been saved on the cluster? then your command is to mark version to be the original one?

 

Also, the only way to avoid script failure after Kernel update is to wait until service come up and run upgrade script again?

Please correct me if I'm wrong

Yes, that's a possible explanation.  If the upgrade script doesn't successfully complete, the msc-version may not be properly set on one or more of the nodes, which will cause any re-attempts to upgrade to fail.  Best advise I can offer is to be patient with these scripts, and allow them to complete or timeout & fail on their own (which they should).  Also ensure you try to run the script from a local host in the same proximity as your MSO nodes (avoid doing it remotely over VPN for example).

Robert