10-26-2016 06:00 PM - edited 03-01-2019 04:32 AM
Lost a lot of time so far today doing all the troubleshooting I can. A few of the core services won't stay running so localhost:14141 locally or remotely doesn't work. TAC or anyone want to run with it from here?
router=4.0.0.14944 goes into FATAL state and some of the others just exit and backoff.
Since services are not coming up I can't evacuate any host or deal with the grape hosts, services, etc.
$ sudo service grapevine status
grapevine is running
grapevine_capacity_manager RUNNING pid 4372, uptime 0:13:25
grapevine_capacity_manager_lxc_plugin RUNNING pid 9794, uptime 0:00:31
grapevine_cassandra RUNNING pid 3799, uptime 0:13:42
grapevine_client BACKOFF Exited too quickly (process log may have details)
grapevine_coordinator_service RUNNING pid 3808, uptime 0:13:42
grapevine_dlx_service BACKOFF Exited too quickly (process log may have details)
grapevine_log_collector RUNNING pid 3811, uptime 0:13:42
grapevine_root RUNNING pid 5869, uptime 0:08:09
grapevine_supervisor_event_listener STARTING
grapevine_ui RUNNING pid 3797, uptime 0:13:42
reverse-proxy=4.0.0.14944 RUNNING pid 3802, uptime 0:13:42
router=4.0.0.14944 FATAL Exited too quickly (process log may have details)
(grapevine)
EDIT: I can probably hack my way through this to figure a few things out. I'd rather work here with some visibility. I don't want to bother Nick in TAC. I'll jump through some services and if I end up breaking it enough I'll rebuild.
Solved! Go to Solution.
11-01-2016 08:45 AM
I fixed without running a reset, but here is what I got from TAC after escalation -
Please perform the following steps to bring the cluster back to a clean/running state:
1. Ensure both VMs are powered "on"
2. SSH into one of the VMs and run the following command:
* reset_grapevine
3. A series of prompts would be presented to the user, asking if they want to delete specific data/configure. Since the customer wants to save their cluster data, for each prompt/question presented, specify "no".
After answering all the prompts, the command will proceed to reset the cluster back to a clean/running state with their data.
Depending on the speed of their hardware, this operation will take around 30-60 minutes to complete.
10-26-2016 08:23 PM
Hi,
we have changed port 14141. It should redirect to
https://<apic>/controllerDevelopment
Common startup issue is ntp. can you verify that the ntp server is reachable?
10-27-2016 07:19 AM
Sure, that redirect is understood by the reverse-proxy/router, but ssh on the box it's trying to connect to localhost:14141 to run service commands. All 'grape' commands return localhost:14141 unavailable.
ntp is good
10-27-2016 08:23 AM
The core services can't connect to message broker which is rabbitmq right?
169.254.1.1:5672
10-27-2016 10:39 AM
When you are redirected to controllerDevelopment, are you able to login? Or you get login error still? (Assuming you are not already logged into through the cluster and redirected; but a fresh login attempt by typing direct link to controllerDevelopment)
Couple of grapevine core services from your previous output look to be in BACKOFF state. Did they recover at all?
I understand grape instance status might not be working. Is it the same case with "grape application status" command?
Worst case I'd suggest resetting the grapevine so ALL the services would be up and running in their own sweet time; but of course, that's the last resort.
11-01-2016 08:45 AM
I fixed without running a reset, but here is what I got from TAC after escalation -
Please perform the following steps to bring the cluster back to a clean/running state:
1. Ensure both VMs are powered "on"
2. SSH into one of the VMs and run the following command:
* reset_grapevine
3. A series of prompts would be presented to the user, asking if they want to delete specific data/configure. Since the customer wants to save their cluster data, for each prompt/question presented, specify "no".
After answering all the prompts, the command will proceed to reset the cluster back to a clean/running state with their data.
Depending on the speed of their hardware, this operation will take around 30-60 minutes to complete.
11-01-2016 10:17 AM
As suspected, the TAC also recommends reset_grapevine. However, since you say you fixed it w/o reset, I'm curious to know how you managed that.
Having said that, here's the recommend specs for UCS hardware for cluster to deploy and run smoothly. Is your UCS hardware compliant with the following specs?
Requirements | Specification |
Server Image Format | ISO |
VMware ESXi Version | 5.1/5.5/6.0 |
Virtual CPU (vCPU) | Minimum Required: 6, Recommend: 12 |
CPU (speed) | 2.4 GHz |
Memory | 64 GB [ For a multi-host deployment (2 or 3 hosts) only 32GB of RAM is required for each host. ] |
Disk Capacity | 500 GB |
Disk I/O Speed | 200 MBps |
Network Adapter | 1 |
Web Access | Required |
Browser | Google Chrome – version 50.0 or later and Firefox - version 46.0 or later |
Network Timing | To avoid conflicting time settings, It is recommended to disable time synchronization between the guest VM running Cisco APIC-EM and the ESXi host. Please use NTP instead. |
11-09-2016 06:45 AM
A process of shutdown, boot, evacuate, shutdown, boot, enable - for each node. The evacuate/enable seems to be the trick for me during the brief period the core services are working.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide