cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

396
Views
1
Helpful
10
Replies
khgrant
Cisco Employee

NSO CDB Transaction Backlog and Locking

 

I am looking for troubleshooting guidance on NSO. We are working on VMS packages and NSO got stuck for 10-15 minutes while executing a command. After that Java VM restarted in operations was restored. In between transactions were coming in.

 

<ERR> 16-Apr-2016::22:44:07.994 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: PAM error - "epam died"

<INFO> 17-Apr-2016::20:45:35.449 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - The NCS Java VM terminated
<CRIT> 17-Apr-2016::20:45:35.459 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-1-cisco-ios:IOSDp2 died
<CRIT> 17-Apr-2016::20:45:35.461 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-2-cisco-ios:IOSDp died
<CRIT> 17-Apr-2016::20:45:35.466 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-3-cisco-pnp:pnp-rfs died
<CRIT> 17-Apr-2016::20:45:35.473 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-15-day0:escday0- died
<CRIT> 17-Apr-2016::20:45:35.479 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-16-ipaddress-allocator:Addressallocationvalida
tion died
<CRIT> 17-Apr-2016::20:45:35.483 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-17-perfdata:PerfDataDp died
<CRIT> 17-Apr-2016::20:45:35.486 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-18-tailf-hcc:tcmCmdDp died
<CRIT> 17-Apr-2016::20:45:35.487 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-19-tailf-hcc:tcmDp died
<CRIT> 17-Apr-2016::20:45:35.487 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-20-vm-manager-esc:ServiceforVMstarting died
<CRIT> 17-Apr-2016::20:45:35.494 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-32-cloudvpn:CloudvpnActionCommands died
<CRIT> 17-Apr-2016::20:45:35.494 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-33-cloudvpn:Easydevicelisting died
<CRIT> 17-Apr-2016::20:45:35.495 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-34-cloudvpn:CVPNreleaseinfo died
<CRIT> 17-Apr-2016::20:45:35.495 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-35-cloudvpn:IPSManagerStackedRFS died
<CRIT> 17-Apr-2016::20:45:35.497 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-36-cloudvpn:IPSSensorStackedRFS died
<CRIT> 17-Apr-2016::20:45:35.497 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-37-cloudvpn:TemplateStackedRFS died
<CRIT> 17-Apr-2016::20:45:37.112 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: Supervision terminated

10 REPLIES 10
khgrant
Cisco Employee

 

The key is here:

 

 

<ERR> 16-Apr-2016::22:44:07.994 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: PAM error - "epam died

 

Seams like a bug.

 

 

Check if the file ncserr.log.1 is not empty and do:

 

 

$ ncs --printlog ncserr.log.1

 

 

Just check if you have a message at the same time of the issue.

 

 

If so, you would need to write a ticket with support with your customer ID.

 

khgrant
Cisco Employee

 

I'm not so sure about that.

 

 

During CVPN development (and deployment in DT) we saw this message over and over again - although we could never link the message to any specific badness. Actually we had PAM disabled in CVPN, so it was all a major mystery.

 

 

I (together with Per Hedeland) spent quite some time on that one at the time, but we could never figure out what was happening.

 

It wasn't possible to reproduce.

 

 

I concluded Redhat PAM bugs at the time, but since we weren't using PAM (DT wanted explicit NSO users stored in CDB) we ignored it and moved on.

 

 

Anyways, the issue is still unresolved as far as I know.

 

 

/klacke

 

khgrant
Cisco Employee

 

+1. And it's not really fatal, it "should" imply that an attempt to

 

authenticate via PAM failed, but "epam" will be started again for the next attempt, and if it doesn't reproduce that should be sufficient.

 

I guess it *could* be some bad stuff given in username/password that we haven't been able to figure out.

 

 

Also, in the log excerpt given by the OP, this occured some 22 *hours* before the real badness started:

 

 

>>> <ERR> 16-Apr-2016::22:44:07.994 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: PAM error - "epam died"

 

>>>

 

>>> <INFO> 17-Apr-2016::20:45:35.449 szg-dr-bdc-ncsf-1-vms102

 

>>> ncs[39198]: - The NCS Java VM terminated

 

 

Not likely to be related.

 

 

--Per

 

khgrant
Cisco Employee

 

Hmm, currently PAM is being used and I can this message logged 1600 times a day. Will log a defect for it.

 

 

 

Here is the interesting part when JVM got restarted:

 

17-Apr-2016::20:45:33.959 <0.288.0> <0.291.0> ** Generic server pam_server terminating

 

** Last message in was {close,99641}

 

** When Server state == {state,99654,#Port<0.1087625>,"AS",[],"common-auth",

 

                               []}

 

** Reason for termination ==

 

** {badarg,[{erlang,port_command,[#Port<0.1087625>,[99,"99641",0]],[]},

 

            {pam_server,handle_call,3,[]},

 

            {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,601}]},

 

            {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}

 

{"17-Apr-2016::20:45:33.961",

 

{error_report,<0.288.0>,

 

     {<0.291.0>,crash_report,

 

      [[{initial_call,{pam_server,init,['Argument__1']}},

 

        {pid,<0.291.0>},

 

        {registered_name,pam_server},

 

        {error_info,

 

            {exit,

 

                {badarg,

 

                    [{erlang,port_command,

 

                         [#Port<0.1087625>,[99,"99641",0]],

 

                         []},

 

                     {pam_server,handle_call,3,[]},

 

                     {gen_server,handle_msg,5,

 

                         [{file,"gen_server.erl"},{line,601}]},

 

                     {proc_lib,init_p_do_apply,3,

 

                         [{file,"proc_lib.erl"},{line,239}]}]},

 

                [{gen_server,terminate,6,[{file,"gen_server.erl"},{line,760}]},

 

                 {proc_lib,init_p_do_apply,3,

 

                     [{file,"proc_lib.erl"},{line,239}]}]}},

 

        {ancestors,[pam_sup,<0.289.0>]},

 

        {messages,[{'EXIT',#Port<0.1087625>,normal}]},

 

        {links,[<0.290.0>]},

 

        {dictionary,[]},

 

        {trap_exit,true},

 

        {status,running},

 

        {heap_size,1598},

 

        {stack_size,27},

 

        {reductions,12995273}],

 

       []]}}}

 

{"17-Apr-2016::20:45:33.963",

 

{error_report,<0.288.0>,

 

     {<0.290.0>,supervisor_report,

 

      [{supervisor,{local,pam_sup}},

 

       {errorContext,child_terminated},

 

       {reason,

 

           {badarg,

 

               [{erlang,port_command,[#Port<0.1087625>,[99,"99641",0]],[]},

 

                {pam_server,handle_call,3,[]},

 

                {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,601}]},

 

                {proc_lib,init_p_do_apply,3,

 

                    [{file,"proc_lib.erl"},{line,239}]}]}},

 

       {offender,

 

           [{pid,<0.291.0>},

 

            {name,pam_server},

 

            {mfargs,{pam_server,start_link,[]}},

 

            {restart_type,permanent},

 

            {shutdown,2000},

 

            {child_type,worker}]}]}}}

 

{"17-Apr-2016::20:45:33.964",

 

{error_report,<0.288.0>,

 

               {<0.290.0>,supervisor_report,

 

                [{supervisor,{local,pam_sup}},

 

                 {errorContext,shutdown},

 

                 {reason,reached_max_restart_intensity},

 

                 {offender,[{pid,<0.291.0>},

 

                            {name,pam_server},

 

                            {mfargs,{pam_server,start_link,[]}},

 

                            {restart_type,permanent},

 

                            {shutdown,2000},

 

                            {child_type,worker}]}]}}}

 

17-Apr-2016::20:45:35.240 <0.28.0> <0.121.0> confd_ia:494: Couldn't map 4 ("CLI") to server: {error,

 

                                                                                              undefined}

 

[{confd_ia,handle_connection,6,[]},

 

{confd_ia,acceptor,5,[]},

 

{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]

 

17-Apr-2016::20:45:36.270 <0.9.0> <0.164.0> ha_vip: PID shutdown died restarting {"17-Apr-2016::20:45:36.276",

 

{error_report,<0.28.0>,

 

     {<0.11571.157>,crash_report,

 

      [[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},

 

        {pid,<0.11571.157>},

 

        {registered_name,[]},

 

        {error_info,

 

            {exit,

 

                {noproc,

 

                    {gen_server,call,[cdb,{req,close_usess,1,[]},infinity]}},

 

                [{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},

 

                 {lists,foreach,2,[{file,"lists.erl"},{line,1323}]},

 

                 {cs_usess,cleanup_usess,3,[]},

 

                 {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},

 

        {ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},

 

        {messages,[]},

 

        {links,[]},

 

        {dictionary,[]},

 

        {trap_exit,false},

 

        {status,running},

 

        {heap_size,610},

 

        {stack_size,27},

 

        {reductions,184}],

 

       []]}}}

 

17-Apr-2016::20:45:36.283 <0.9.0> <0.754.0> restarting pnp server

 

{"17-Apr-2016::20:45:36.288",

 

{error_report,<0.28.0>,

 

     {<0.10924.157>,crash_report,

 

      [[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},

 

        {pid,<0.10924.157>},

 

        {registered_name,[]},

 

        {error_info,

 

            {exit,

 

                {noproc,

 

                    {gen_server,call,[cdb,{req,close_usess,35,[]},infinity]}},

 

                [{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},

 

                 {lists,foreach,2,[{file,"lists.erl"},{line,1323}]},

 

                 {cs_usess,cleanup_usess,3,[]},

 

                 {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},

 

        {ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},

 

        {messages,[]},

 

        {links,[]},

 

        {dictionary,[]},

 

        {trap_exit,false},

 

        {status,running},

 

        {heap_size,376},

 

        {stack_size,27},

 

        {reductions,167}],

 

       []]}}}

 

17-Apr-2016::20:45:36.293 <0.9.0> <0.10376.157> econfd_internal:79: econfd_internal: Client Id 3 not registered by confd_ia: {error,

 

                                                                                                                              undefined}

 

[{econfd_internal,connect,4,[]},

 

{econfd_daemon,ctl_cnct,3,[]},

 

{econfd_daemon,do_cnct,3,[]},

 

{econfd_daemon,init,1,[]},

 

{gen_server,init_it,6,[{file,"gen_server.erl"},{line,315}]},

 

{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]

 

{"17-Apr-2016::20:45:36.295",

 

{error_report,<0.9.0>,

 

     {<0.10376.157>,crash_report,

 

      [[{initial_call,{econfd_daemon,init,['Argument__1']}},

 

        {pid,<0.10376.157>},

 

        {registered_name,[]},

 

        {error_info,

 

            {exit,econnrefused,

 

                [{gen_server,init_it,6,[{file,"gen_server.erl"},{line,331}]},

 

                 {proc_lib,init_p_do_apply,3,

 

                     [{file,"proc_lib.erl"},{line,239}]}]}},

 

        {ancestors,[pnp,<0.754.0>,<0.173.0>,<0.172.0>]},

 

        {messages,[]},

 

        {links,[<0.11939.157>]},

 

        {dictionary,[{sock_int,1}]},

 

        {trap_exit,false},

 

        {status,running},

 

        {heap_size,376},

 

        {stack_size,27},

 

        {reductions,147}],

 

       [{neighbour,

 

            [{pid,<0.11939.157>},

 

             {registered_name,pnp},

 

             {initial_call,{pnp,init,['Argument__1']}},

 

             {current_function,{proc_lib,sync_wait,2}},

 

             {ancestors,[<0.754.0>,<0.173.0>,<0.172.0>]},

 

             {messages,[{ack,<0.10376.157>,{error,econnrefused}}]},

 

             {links,[<0.10376.157>]},

 

             {dictionary,[]},

 

             {trap_exit,false},

 

             {status,runnable},

 

             {heap_size,233},

 

             {stack_size,19},

 

             {reductions,92}]}]]}}}

 

{"17-Apr-2016::20:45:36.296",

 

{error_report,<0.9.0>,

 

     {<0.754.0>,crash_report,

 

      [[{initial_call,{pnp,'-start/0-fun-0-',[]}},

 

        {pid,<0.754.0>},

 

        {registered_name,[]},

 

        {error_info,

 

            {error,

 

                {badmatch,{error,econnrefused}},

 

                [{pnp,monitor_pnp,1,[{file,"pnp.erl"},{line,333}]},

 

                 {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},

 

        {ancestors,[<0.173.0>,<0.172.0>]},

 

        {messages,[]},

 

        {links,[]},

 

        {dictionary,[]},

 

        {trap_exit,false},

 

        {status,running},

 

        {heap_size,376},

 

        {stack_size,27},

 

        {reductions,113}],

 

       []]}}}

 

{"17-Apr-2016::20:45:36.297",

 

{error_report,<0.28.0>,

 

     {<0.11881.157>,crash_report,

 

      [[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},

 

        {pid,<0.11881.157>},

 

        {registered_name,[]},

 

        {error_info,

 

            {exit,

 

                {noproc,

 

                    {gen_server,call,[cdb,{req,close_usess,34,[]},infinity]}},

 

                [{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},

 

                 {lists,foreach,2,[{file,"lists.erl"},{line,1323}]},

 

                 {cs_usess,cleanup_usess,3,[]},

 

                 {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},

 

        {ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},

 

        {messages,[]},

 

        {links,[]},

 

        {dictionary,[]},

 

        {trap_exit,false},

 

        {status,running},

 

        {heap_size,376},

 

        {stack_size,27},

 

        {reductions,175}],

 

       []]}}}

 

{"17-Apr-2016::20:45:36.298",

 

{error_report,<0.28.0>,

 

     {<0.12740.157>,crash_report,

 

      [[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},

 

        {pid,<0.12740.157>},

 

        {registered_name,[]},

 

        {error_info,

 

            {exit,

 

                {noproc,

 

                    {gen_server,call,[cdb,{req,close_usess,16,[]},infinity]}},

 

                [{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},

 

                 {lists,foreach,2,[{file,"lists.erl"},{line,1323}]},

 

                 {cs_usess,cleanup_usess,3,[]},

 

                 {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},

 

        {ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},

 

        {messages,[]},

 

        {links,[]},

 

        {dictionary,[]},

 

        {trap_exit,false},

 

        {status,running},

 

        {heap_size,376},

 

        {stack_size,27},

 

        {reductions,167}],

 

       []]}}}

 

17-Apr-2016::20:45:38.113 <0.28.0> <0.32.0> init:boot_msg: "Internal error: Supervision terminated\n"17-Apr-2016::20:46:18.586 <0.288.0> <0.291.0> epam port program died

 

17-Apr-2016::20:46:41.110 <0.288.0> <0.291.0> epam port program died

 

… SNIP

 

 

BR,

 

Bostjan

 

 

khgrant
Cisco Employee

 

Ah, forgot to ask again the original Q. Where can  I see what NSO is currently processing in Java VM, what is getting it stuck? Can I see transaction backlog that got formed during that time?

 

 

BR,

 

Bostjan

 

khgrant
Cisco Employee

 

Please do - it would be great if we could nail down this problem.

 

--Per

khgrant
Cisco Employee

 

Hi,

 

  

Ah, forgot to ask again the original Q. Where can  I see what NSO is currently processing in Java VM, what is getting it stuck? Can I see transaction backlog that got formed during that time?

  

 

Have you turned on the java logging.

 

That can be done via the CLI

 

Now the java-vm can be occupied with a lot of task depending on how many packages you have.

 

Also it would be good to narrow it down to some apis, for instance:

 

To debug log cli or generic ned calls:

 

set java-vm java-logging logger com.tailf.ned level level-debug

 

 

For services (or other Dp callbacks):

 

set java-vm java-logging logger com.tailf.dp level level-debug

 

 

For cdb subscribing application components

 

set java-vm java-logging logger com.tailf.cdb level level-debug

 

 

You can expect of logging output and if something in java deadlocks or waits can be more or less hard to pinpoint.

 

 

Another way connect a debugger the java-vm (description in the devel guide)

 

Then you can find individual threads and suspend some suspects to see what they do.

 

 

jvisualvm is a tool that you can connect to the vm an see individual threads and some info of what they do.

 

 

 

/Try

 

khgrant
Cisco Employee

 

Hi Bostjan, Per,

 

 

We have an SR for the same.

 

 

 

I think there are two issues here.

 

One is where we have "Internal error: PAM error - "epam died² ‹> from reading the thread:

 

.. this seems to be a day-1 issue and has not been resolved.

 

.. epam is being used heavily (mainly for access by northbound devices, read portals)

 

Q1: Do we have a bug on this one (internal or CDETS)?

 

 

The other one is a race condition where NCS restarted because epam died.

 

The support team identified it as #14395.

 

This is what I got from them:

 

 

"This epam client died "exactly" when the request, initiated by the webui, was sent to the epam but it failed since it just terminated and hadn't been restarted yet. ³

 

 

Q2: I guess the request could come from anywhere to trigger this restart (not only webui), right?

 

 

FYI, in logs, I see frequently ³epam died² messages:

 

 

<ERR> 24-Apr-2016::02:04:14.933 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 24-Apr-2016::08:43:50.033 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 25-Apr-2016::23:49:42.848 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 26-Apr-2016::18:14:35.177 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 27-Apr-2016::07:49:37.606 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 28-Apr-2016::12:17:06.487 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 29-Apr-2016::06:27:21.107 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 2-May-2016::02:29:28.093 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 3-May-2016::00:17:48.362 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 3-May-2016::00:17:48.364 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

<ERR> 3-May-2016::08:21:05.173 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"

 

 

 

 

 

Thanks,

 

 

Peter

 

khgrant
Cisco Employee

 

Hi Peter,

 

 

I opened a defect for epam dying (CSCuz29444).

 

 

Cant comment on your second Q.

 

 

BR,

 

Bostjan

 

khgrant
Cisco Employee

 

Hi Peter,

 

 

About your Q2 - the "request from webui" explanation was probably due to an incorrect initial analysis by me - in fact in the current code, the issue in our ticket #14395 is *never* trigged by the webui, but only by SSH sessions, i.e. NETCONF or CLI access via the NSO-internal SSH server - and specifically when such a session terminates.

 

 

Regarding Q1, the "epam died" in general, this can be triggered by any NB agent using password authentication when PAM is enabled. Since we have not been able to reproduce this, we need help from someone that encounters it, as a first step trying to get hold of a core dump from "epam" when it dies. I believe the support team has discussed with you regarding how to go about this.

 

 

--Per

 

Create
Recognize Your Peers
Content for Community-Ad