SD-WAN manager refuses to start nms services

stevesmith741 · ‎08-01-2024

Hi community, I've beed beating my head against a brick wall for days now. Over it. I have installed controller nodes in a CML2 lab following the procedure at https://www.youtube.com/watch?v=CQruni5x8Vk and despite many attempts I cannot get the nms services to start on the manager node. Outputs from nms requests are below - these were taken 2 hours after booting. All the search results when I look for these conditions just talk about ensuring there is enough RAM (32GB) and a second data volume has been created for the database (I have - 250GB, formatted, very slow process). VPN 0 is configured per the advice in the video and all controllers can ping each other. There must surely be some useful debugging tips out there? Am I unique in being unable to get this to work!

vmanage# request nms all status
NMS service proxy
Enabled: false
Status: not running
NMS service proxy rate limit
Enabled: false
Status: not running
NMS application server
Enabled: false
Status: not running
NMS configuration database
Enabled: false
Status: not running
NMS coordination server
Enabled: false
Status: not running
NMS messaging server
Enabled: false
Status: not running
NMS statistics database
Enabled: false
Status: not running
NMS data collection agent
Enabled: false
Status: not running
NMS CloudAgent v2
Enabled: false
Status: not running
NMS cloud agent
Enabled: false
Status: not running
NMS SDAVC server
Enabled: false
Status: not running
NMS SDAVC gateway
Enabled: false
Status: not running
vManage Device Data Collector
Enabled: false
Status: not running
NMS OLAP database
Enabled: false
Status: not running
vManage Reporting
Enabled: false
Status: not running

vmanage# request nms all diag
NMS service server is disabled
NMS application server is disabled
NMS configuration database is disabled
NMS statistics database is disabled
NMS data collection agent is disabled
NMS coordination server is disabled
NMS container manager is disabled
NMS SDAVC server is disabled on this vmanage node
NMS Device Data Collector is disabled
NMS OLAP database is disabled
This action is not supported

vmanage# request nms all start
NMS statistics database configuration setting is currently unavailable
NMS configuration database configuration setting is currently unavailable
NMS coordination server configuration setting is currently unavailable
NMS messaging server configuration setting is currently unavailable
NMS application server configuration setting is currently unavailable
NMS service proxy configuration setting is currently unavailable
NMS cloud agent is disabled
NMS CloudAgent v2 configuration setting is currently unavailable
NMS OLAP database configuration setting is currently unavailable
vManage Device Data Collector configuration setting is currently unavailable
vManage Reporting configuration setting is currently unavailable
NMS SDAVC server configuration setting is currently unavailable

vmanage# request nms all restart
Stop was not successful for the service sdavc. Retrying
Stop was not successful for the service cloudagent-v2. Retrying
Successfully stopped NMS cloud agent
Stop was not successful for the service service-proxy. Retrying
Stop was not successful for the service application-server. Retrying
Stop was not successful for the service messaging-server. Retrying
Stop was not successful for the service coordination-server. Retrying
Stop was not successful for the service configuration-db. Retrying
Stop was not successful for the service statistics-db. Retrying
Stop was not successful for the service device-data-collector. Retrying
Stop was not successful for the service olap-db. Retrying
Stop was not successful for the service reporting. Retrying
NMS statistics database configuration setting is currently unavailable
NMS configuration database configuration setting is currently unavailable
NMS coordination server configuration setting is currently unavailable
NMS messaging server configuration setting is currently unavailable
NMS application server configuration setting is currently unavailable
NMS service proxy configuration setting is currently unavailable
NMS cloud agent is disabled
NMS CloudAgent v2 configuration setting is currently unavailable
NMS OLAP database configuration setting is currently unavailable
vManage Device Data Collector configuration setting is currently unavailable
vManage Reporting configuration setting is currently unavailable
NMS SDAVC server configuration setting is currently unavailable
vmanage#

ericgar · ‎08-12-2024

Hi Steve, this is Eric from SDWAN TAC Team.

I understand you have issues with the NMS services of your lab.

What is your vManage version?
Is this a brand new deployment?
Was it working before? If so, what was the version that was running before?
Follow the below process:
- Access vshell by doing:
  vshell
- Verify the logs:

tail -f /var/log/nms/vmanage-server.log

Try to bring the services up by doing:
request nms all restart
Capture any error log you observe in the command line.
Share them with me.

stevesmith741 · ‎08-12-2024

Hi Eric, thanks for coming to my aid! The details you requested are copied below.

Steve

vManage version 20.13.1

New deployment

Never worked

vmanage:~$ tail -f /var/log/nms/vmanage-server.log
tail: cannot open '/var/log/nms/vmanage-server.log' for reading: No such file or directory
tail: no files remaining

vmanage# request nms all restart
Stop was not successful for the service sdavc. Retrying
Found network_error when stopping NMS SDAVC server
Stop was not successful for the service cloudagent-v2. Retrying
Found network_error when stopping NMS CloudAgent v2
Successfully stopped NMS cloud agent
Stop was not successful for the service service-proxy. Retrying
Found network_error when stopping NMS service proxy
Stop was not successful for the service application-server. Retrying
Found network_error when stopping NMS application server
Stop was not successful for the service messaging-server. Retrying
Found network_error when stopping NMS messaging server
Stop was not successful for the service coordination-server. Retrying
Found network_error when stopping NMS coordination server
Stop was not successful for the service configuration-db. Retrying
Found network_error when stopping NMS configuration database
Stop was not successful for the service statistics-db. Retrying
Found network_error when stopping NMS statistics database
Stop was not successful for the service device-data-collector. Retrying
Found network_error when stopping vManage Device Data Collector
Stop was not successful for the service olap-db. Retrying
Found network_error when stopping NMS OLAP database
Stop was not successful for the service reporting. Retrying
Found network_error when stopping vManage Reporting
NMS statistics database configuration setting is currently unavailable
NMS configuration database configuration setting is currently unavailable
NMS coordination server configuration setting is currently unavailable
NMS messaging server configuration setting is currently unavailable
NMS application server configuration setting is currently unavailable
NMS service proxy configuration setting is currently unavailable
NMS cloud agent is disabled
NMS CloudAgent v2 configuration setting is currently unavailable
NMS OLAP database configuration setting is currently unavailable
vManage Device Data Collector configuration setting is currently unavailable
vManage Reporting configuration setting is currently unavailable
NMS SDAVC server configuration setting is currently unavailable

stevesmith741 · ‎08-12-2024

Not sure if any of these outputs could suggest a cause:

This one shows up once a minute roughly -

vmanage:/var/log$ tail nms/vmanage-runutils.log
13-Aug-2024 12:40:25,598 AEST INFO [] [EncryptionFacility] (main) || CryptoKey store initialized
13-Aug-2024 12:40:25,605 AEST ERROR [] [Multiplexer] (main) || Failed to generate password for UC user

This one seems to be do with the vbond session:

vmanage:/var/log$ tail vdebug
Aug 13 01:19:28 inserthostname-here VTRACKER[4991]: monitor_recv_icmp_echo_rsp[366]: Received unexpected session id 0x004d from 10.1.1.2, expected 0x606e
Aug 13 01:19:28 inserthostname-here VTRACKER[4991]: monitor_probe_handle_recv_result[284]: Error processing response from 0/10.1.1.2/0 via none
Aug 13 01:49:32 inserthostname-here SYSMGR[919]: %Viptela-vmanage-sysmgrd-6-INFO-1400002: Notification: system-logout-change severity-level:minor host-name:"vmanage" system-ip:100.1.1.1 user-name:"cisco" user-id:17 generated-at:8-13-2024T1:49:32
Aug 13 02:10:27 inserthostname-here SYSMGR[919]: %Viptela-vmanage-sysmgrd-6-INFO-1400002: Notification: system-login-change severity-level:minor host-name:"vmanage" system-ip:100.1.1.1 user-name:"cisco" user-id:22 generated-at:8-13-2024T2:10:27
Aug 13 02:10:57 inserthostname-here VTRACKER[4991]: monitor_recv_icmp_echo_rsp[366]: Received unexpected session id 0x450a from 10.1.1.2, expected 0x5dbc
Aug 13 02:10:57 inserthostname-here VTRACKER[4991]: monitor_probe_handle_recv_result[284]: Error processing response from 0/10.1.1.2/0 via none

SamuelGLN · ‎08-12-2024

Hi @stevesmith741

I had the same problem during my study lab. The first one was about the CPU and vRAM resources that wasn't enough. You can check this on the follow link: https://www.cisco.com/c/en/us/td/docs/routers/sdwan/release/notes/compatibility-and-server-recommendations/server-requirements.html

The second one occured when I did a vManage upgrade. I solved by creating a new vm and performing the restore vManage by utilization of a configuration-db backup. You can check this on the follow link: https://www.cisco.com/c/en/us/support/docs/routers/sd-wan/220305-standalone-vmanage-disaster-recovery.html

Best regards
******* If This Helps, Please Rate *******

stevesmith741 · ‎08-12-2024

Thanks Samuel, I've checked the resources allocated and I have 32G RAM as recommended and 16 vCPUs as recommended. The only variation was in disk allocated. I had 256G where 512 is recommended. Upping that now and will advise if anything changes.

Steve

stevesmith741 · ‎08-12-2024

Changed disk size to 512. No change in behaviour so far.

stevesmith741 · ‎08-13-2024

Ok, now I have a working manager again. I gave up on the one supplied with the cml2 reference platform as nothing seems to get it working. I decided to go all the way with a complete node definition from scratch following instructions at https://ether-net.com/2020/08/24/ccie-6-how-to-lab-cisco-sd-wan-in-cml2/. I chose to download the stable release recommended on CCO (20.9) rather than 20.13. So now I have manager running 20.9, controller and orchestrator both on 20.13. Sounds like trouble! Not sure what the path of least resistance to go forward with is. Downgrade of orchestrator and controller to 20.9, or upgrade of manager to 20.13? Or maybe just hope the two versions play nicely in the schoolyard!

stevesmith741 · ‎08-15-2024

I begin to wonder whether it's actually possible to build a stable cisco sd-wan environment. After finally breaking through and getting the manager component running and being able to access the GUI, now I find the GUI has stopped responding, for no apparent reason, despite not having made any changes to it or the other controllers. I am pretty much on the verge of giving up. the product seems to be extremely fussy and unstable, the documentation is woeful, there are too many ways to run it up and the instructions for various processes fail unless your environment is setup in a very specific manner. clearly if you don't have access to the TAC you shouldn't be bothering to try and run up this environment. it's taken weeks out of my life and I think I'm no closer than I was at the beginning!