cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2355
Views
7
Helpful
7
Replies

1.3 upgrade and core services failure

Joshua Warcop
Level 5
Level 5

Lost a lot of time so far today doing all the troubleshooting I can. A few of the core services won't stay running so localhost:14141 locally or remotely doesn't work. TAC or anyone want to run with it from here?

router=4.0.0.14944 goes into FATAL state and some of the others just exit and backoff.

Since services are not coming up I can't evacuate any host or deal with the grape hosts, services, etc.

$ sudo service grapevine status

grapevine is running

grapevine_capacity_manager              RUNNING   pid 4372, uptime 0:13:25

grapevine_capacity_manager_lxc_plugin   RUNNING   pid 9794, uptime 0:00:31

grapevine_cassandra                     RUNNING   pid 3799, uptime 0:13:42

grapevine_client                        BACKOFF   Exited too quickly (process log may have details)

grapevine_coordinator_service           RUNNING   pid 3808, uptime 0:13:42

grapevine_dlx_service                   BACKOFF   Exited too quickly (process log may have details)

grapevine_log_collector                 RUNNING   pid 3811, uptime 0:13:42

grapevine_root                          RUNNING   pid 5869, uptime 0:08:09

grapevine_supervisor_event_listener     STARTING

grapevine_ui                            RUNNING   pid 3797, uptime 0:13:42

reverse-proxy=4.0.0.14944               RUNNING   pid 3802, uptime 0:13:42

router=4.0.0.14944                      FATAL     Exited too quickly (process log may have details)

(grapevine)

EDIT: I can probably hack my way through this to figure a few things out. I'd rather work here with some visibility. I don't want to bother Nick in TAC. I'll jump through some services and if I end up breaking it enough I'll rebuild.

1 Accepted Solution

Accepted Solutions

I fixed without running a reset, but here is what I got from TAC after escalation -

Please perform the following steps to bring the cluster back to a clean/running state:

   1.  Ensure both VMs are powered "on"

   2.  SSH into one of the VMs and run the following command:

      *   reset_grapevine

   3.  A series of prompts would be presented to the user, asking if they want to delete specific data/configure.  Since the customer wants to save their cluster data, for each prompt/question presented, specify "no".

After answering all the prompts, the command will proceed to reset the cluster back to a clean/running state with their data.

Depending on the speed of their hardware, this operation will take around 30-60 minutes to complete.

View solution in original post

7 Replies 7

aradford
Cisco Employee
Cisco Employee

Hi,

we have changed port 14141.  It should redirect to

https://<apic>/controllerDevelopment

Common startup issue is ntp. can you verify that the ntp server is reachable?

Sure, that redirect is understood by the reverse-proxy/router, but ssh on the box it's trying to connect to localhost:14141 to run service commands. All 'grape' commands return localhost:14141 unavailable.

ntp is good

Joshua Warcop
Level 5
Level 5

The core services can't connect to message broker which is rabbitmq right?

169.254.1.1:5672

When you are redirected to controllerDevelopment, are you able to login? Or you get login error still? (Assuming you are not already logged into through the cluster and redirected; but a fresh login attempt by typing direct link to controllerDevelopment)

Couple of grapevine core services from your previous output look to be in BACKOFF state. Did they recover at all?

I understand grape instance status might not be working. Is it the same case with "grape application status" command?

Worst case I'd suggest resetting the grapevine so ALL the services would be up and running in their own sweet time; but of course, that's the last resort.

I fixed without running a reset, but here is what I got from TAC after escalation -

Please perform the following steps to bring the cluster back to a clean/running state:

   1.  Ensure both VMs are powered "on"

   2.  SSH into one of the VMs and run the following command:

      *   reset_grapevine

   3.  A series of prompts would be presented to the user, asking if they want to delete specific data/configure.  Since the customer wants to save their cluster data, for each prompt/question presented, specify "no".

After answering all the prompts, the command will proceed to reset the cluster back to a clean/running state with their data.

Depending on the speed of their hardware, this operation will take around 30-60 minutes to complete.

As suspected, the TAC also recommends reset_grapevine. However, since you say you fixed it w/o reset, I'm curious to know how you managed that.

Having said that, here's the recommend specs for UCS hardware for cluster to deploy and run smoothly. Is your UCS hardware compliant with the following specs?

RequirementsSpecification
Server Image FormatISO
VMware ESXi Version5.1/5.5/6.0
Virtual CPU (vCPU)Minimum Required: 6, Recommend: 12
CPU (speed)2.4 GHz
Memory64 GB [ For a multi-host deployment (2 or 3 hosts) only 32GB of RAM is required for each host. ]
Disk Capacity500 GB
Disk I/O Speed200 MBps
Network Adapter1
Web AccessRequired
BrowserGoogle Chrome – version 50.0 or later and Firefox - version 46.0 or later
Network TimingTo avoid conflicting time settings, It is recommended to disable time synchronization between the guest VM running Cisco APIC-EM and the ESXi host. Please use NTP instead.

A process of shutdown, boot, evacuate, shutdown, boot, enable - for each node. The evacuate/enable seems to be the trick for me during the brief period the core services are working.