Re: vManage (Manager) upgrade failing - cannot activate new image

UnspokenDrop7 · ‎05-13-2024

Background:
I have a vManager (Manager), a vBond (Validator) and a couple of edge routers (1100-series running 17.12.x and 17.13.x) setup in the lab for testing SD-Routing ("SD-WAN lite"). vSmart (Controller) is not necessary for SD-Routing. Everything is up and running, connected and working as expected.
Late April, Cisco released a new version for the controllers, 20.14.x (https://software.cisco.com/download/home/286320995/type/286321394/release/20.14.1). This new version comes with some new features compared to 20.13.x, which is currently under investigation in the lab. So, naturally, I wanted to explore the new features and started the upgrade. I have followed the upgrade procedure in this guide: https://www.cisco.com/c/en/us/support/docs/wan/dpt/220424-upgrade-sd-wan-controllers-with-the-use.html. I have successfully managed to upgrade the controllers from 20.12.x to 20.13.x, just some weeks earlier, following this guide.

The problem:
I have successfully installed the new vManage version, but in the next step, when I initiate the activation of the new version, it fails with the following message in the GUI log:
[13-May-2024 14:43:19 UTC] Checking the configuration-dbStatus it may take up to 40 mins or longer
[13-May-2024 14:43:19 UTC] Change Partition action submitted for execution
[13-May-2024 14:43:20 UTC] Executing device action Change Partition
[13-May-2024 14:43:20 UTC] Checking available software image on the device
[13-May-2024 14:43:20 UTC] Failure in triggering upgrade coordinator service, please retry activate.

Actions taken:
I have retried many times to activate, even rebooted vManager and then retried, but still the same message.
I have not found any clues online to this error message, or what the possible solution could be.

Future actions:
Maybe I need to do some more extensive debugging, to get the details about what exactly is not working and why?

Hopefully, someone here can point me in the right direction! Any help is much appreciated!

Rajeev Sharma · ‎05-14-2024

As you pointed out this needs some additional debugging so for starters:

1. Please run #req nms all status in CLI and check for coordinator service, if it's running.

2. Login to vshell and tail logs $tail -f /var/log/nms/vmanage-server.log

You may find some additional clue about the problem.

HTH.

UnspokenDrop7 · ‎05-16-2024

Hi @Rajeev Sharma

First, thanks for replying!

1. The output of "request nms all status" looks good.

NMS service proxy
Enabled: true
Status: running PID:43072 for 757798s
NMS service proxy rate limit
Enabled: true
Status: running PID:47079 for 757693s
NMS application server
Enabled: true
Status: running PID:66072 for 756851s
NMS configuration database
Enabled: true
Status: running PID:44639 for 757799s
Checking configuration-db metrics generation status...
Native metrics status: ENABLED
Server-load metrics status: ENABLED
NMS coordination server
Enabled: true
Status: running PID:44911 for 757790s
NMS messaging server
Enabled: true
Status: running PID:47232 for 757684s
NMS statistics database
Enabled: true
Status: running PID:45427 for 757790s
NMS data collection agent
Enabled: true
Status: running PID:43546 for 757793s
NMS CloudAgent v2
Enabled: true
Status: running PID:47157 for 757689s
NMS cloud agent
Enabled: true
Status: running PID:42718 for 757823s
NMS SDAVC server
Enabled: false
Status: not running
NMS SDAVC gateway
Enabled: false
Status: not running
vManage Device Data Collector
Enabled: true
Status: running PID:51605 for 757477s
NMS OLAP database
Enabled: true
Status: running PID:45352 for 757797s
vManage Reporting
Enabled: true
Status: running PID:47080 for 757692s

2. The output of the logs (tail logs $tail -f /var/log/nms/vmanage-server.log) when I ran the activation included these errors. Not sure the last one is related to this issue, I guess SSE credentials are related to Security Service Edge (which is not something that I use in the lab).

16-May-2024 07:19:40,223 UTC ERROR [] [] [ChangePartitionActionProcessor] (device-action-change_partition-5) || Failed to process change partition for device 10.171.247.220
com.viptela.vmanage.server.deviceaction.DeviceActionException: Failure in triggering upgrade coordinator service, please retry activate.
at com.viptela.vmanage.server.deviceaction.processor.software.ChangePartitionActionProcessor$ChangePartitionlActionWorker.executeUpgradeCoordinatorWorkflow(ChangePartitionActionProcessor.java:631) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.processor.software.ChangePartitionActionProcessor$ChangePartitionlActionWorker.startMaintenanceDeviceActions(ChangePartitionActionProcessor.java:347) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.DefaultActionWorker.startDeviceAction(DefaultActionWorker.java:211) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.scheduler.AbstractSchedulerWorker.call(AbstractSchedulerWorker.java:100) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.scheduler.AbstractSchedulerWorker.call(AbstractSchedulerWorker.java:57) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: Failed : HTTP error code : 409
at com.viptela.vmanage.server.deviceaction.processor.software.UpgradeCoordinatorServiceManager.enableUCService(UpgradeCoordinatorServiceManager.java:171) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.processor.software.UpgradeCoordinatorServiceManager.startService(UpgradeCoordinatorServiceManager.java:193) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.processor.software.ChangePartitionActionProcessor$ChangePartitionlActionWorker.executeUpgradeCoordinatorWorkflow(ChangePartitionActionProcessor.java:624) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
... 8 more


16-May-2024 07:19:40,295 UTC INFO [] [] [NetConfClient] (device-action-change_partition-5) || processing other exception {}
com.viptela.vmanage.server.deviceaction.DeviceActionException: Failure in triggering upgrade coordinator service, please retry activate.
at com.viptela.vmanage.server.deviceaction.processor.software.ChangePartitionActionProcessor$ChangePartitionlActionWorker.executeUpgradeCoordinatorWorkflow(ChangePartitionActionProcessor.java:631) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.processor.software.ChangePartitionActionProcessor$ChangePartitionlActionWorker.startMaintenanceDeviceActions(ChangePartitionActionProcessor.java:347) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.DefaultActionWorker.startDeviceAction(DefaultActionWorker.java:211) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.scheduler.AbstractSchedulerWorker.call(AbstractSchedulerWorker.java:100) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.scheduler.AbstractSchedulerWorker.call(AbstractSchedulerWorker.java:57) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: Failed : HTTP error code : 409
at com.viptela.vmanage.server.deviceaction.processor.software.UpgradeCoordinatorServiceManager.enableUCService(UpgradeCoordinatorServiceManager.java:171) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.processor.software.UpgradeCoordinatorServiceManager.startService(UpgradeCoordinatorServiceManager.java:193) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.deviceaction.processor.software.ChangePartitionActionProcessor$ChangePartitionlActionWorker.executeUpgradeCoordinatorWorkflow(ChangePartitionActionProcessor.java:624) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
... 8 more

16-May-2024 07:19:54,858 UTC ERROR [] [] [SigDao] (ScheduleManager-3) || SSE credentials are missing. Please configure it in Admin->Settings
com.viptela.vmanage.server.sse.cisco.CiscoSseException: SSE credentials are missing. Please configure it in Admin->Settings
at com.viptela.vmanage.server.sse.SseUtils.getAllCiscoSseTunnelStatus(SseUtils.java:456) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.device.sig.SigDao.updateSSETunnelStatus(SigDao.java:174) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at com.viptela.vmanage.server.scheduler.SSETunnelUpdate.run(SSETunnelUpdate.java:62) ~[vmanage-server-1.0.0-SNAPSHOT.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]

Again, any help moving forward is much appreciated!

Rajeev Sharma · ‎06-17-2024

Could you also capture "request nms all diag"?

UnspokenDrop7 · ‎07-24-2024

Sure, the output is attached!

Osvaldo Salazar Tovar · ‎07-27-2024

Hi,

Does it give you the same error if you try from CLI?
vManage# request software activate [your version number]

UnspokenDrop7 · ‎07-29-2024

The new version (20.14.1) is not listed as an alternative when I try to run that command. Only the old and current ones, 20.12.x and 20.13.x, are listed.

So, now I'm trying to do the software install from CLI.

vmanage# request software install /opt/data/app-server/software/package/vmanage-20.14.1-x86_64.tar.gz

However, the CLI has hung on this command for more than 2 hours now, I think. Not sure if the session have timed out or if it still just work on the installation. I think I saw that the max run time was 60 minutes by default, at least if you run these things from the WebUI.

However, I managed to install a fresh Manager and Validator last week, running 20.14.1. I have moved my 2 Edge devices to the new installation, which went really smooth. Download the "profile" file from Smart Account PnP portal, install on the new manager and then re-configure the Edge devices to point to the new Validator (vBond) IP address.

So, lets drop this case!
Thanks for the help, much appreciated!