I am looking for troubleshooting guidance on NSO. We are working on VMS packages and NSO got stuck for 10-15 minutes while executing a command. After that Java VM restarted in operations was restored. In between transactions were coming in.
<ERR> 16-Apr-2016::22:44:07.994 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: PAM error - "epam died"
<INFO> 17-Apr-2016::20:45:35.449 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - The NCS Java VM terminated
<CRIT> 17-Apr-2016::20:45:35.459 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-1-cisco-ios:IOSDp2 died
<CRIT> 17-Apr-2016::20:45:35.461 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-2-cisco-ios:IOSDp died
<CRIT> 17-Apr-2016::20:45:35.466 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-3-cisco-pnp:pnp-rfs died
<CRIT> 17-Apr-2016::20:45:35.473 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-15-day0:escday0- died
<CRIT> 17-Apr-2016::20:45:35.479 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-16-ipaddress-allocator:Addressallocationvalida
tion died
<CRIT> 17-Apr-2016::20:45:35.483 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-17-perfdata:PerfDataDp died
<CRIT> 17-Apr-2016::20:45:35.486 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-18-tailf-hcc:tcmCmdDp died
<CRIT> 17-Apr-2016::20:45:35.487 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-19-tailf-hcc:tcmDp died
<CRIT> 17-Apr-2016::20:45:35.487 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-20-vm-manager-esc:ServiceforVMstarting died
<CRIT> 17-Apr-2016::20:45:35.494 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-32-cloudvpn:CloudvpnActionCommands died
<CRIT> 17-Apr-2016::20:45:35.494 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-33-cloudvpn:Easydevicelisting died
<CRIT> 17-Apr-2016::20:45:35.495 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-34-cloudvpn:CVPNreleaseinfo died
<CRIT> 17-Apr-2016::20:45:35.495 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-35-cloudvpn:IPSManagerStackedRFS died
<CRIT> 17-Apr-2016::20:45:35.497 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-36-cloudvpn:IPSSensorStackedRFS died
<CRIT> 17-Apr-2016::20:45:35.497 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Daemon ncs-dp-37-cloudvpn:TemplateStackedRFS died
<CRIT> 17-Apr-2016::20:45:37.112 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: Supervision terminated
The key is here:
<ERR> 16-Apr-2016::22:44:07.994 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: PAM error - "epam died”
Seams like a bug.
Check if the file ncserr.log.1 is not empty and do:
$ ncs --printlog ncserr.log.1
Just check if you have a message at the same time of the issue.
If so, you would need to write a ticket with support with your customer ID.
I'm not so sure about that.
During CVPN development (and deployment in DT) we saw this message over and over again - although we could never link the message to any specific badness. Actually we had PAM disabled in CVPN, so it was all a major mystery.
I (together with Per Hedeland) spent quite some time on that one at the time, but we could never figure out what was happening.
It wasn't possible to reproduce.
I concluded Redhat PAM bugs at the time, but since we weren't using PAM (DT wanted explicit NSO users stored in CDB) we ignored it and moved on.
Anyways, the issue is still unresolved as far as I know.
/klacke
+1. And it's not really fatal, it "should" imply that an attempt to
authenticate via PAM failed, but "epam" will be started again for the next attempt, and if it doesn't reproduce that should be sufficient.
I guess it *could* be some bad stuff given in username/password that we haven't been able to figure out.
Also, in the log excerpt given by the OP, this occured some 22 *hours* before the real badness started:
>>> <ERR> 16-Apr-2016::22:44:07.994 szg-dr-bdc-ncsf-1-vms102 ncs[39198]: - Internal error: PAM error - "epam died"
>>>
>>> <INFO> 17-Apr-2016::20:45:35.449 szg-dr-bdc-ncsf-1-vms102
>>> ncs[39198]: - The NCS Java VM terminated
Not likely to be related.
--Per
Hmm, currently PAM is being used and I can this message logged 1600 times a day. Will log a defect for it.
Here is the interesting part when JVM got restarted:
17-Apr-2016::20:45:33.959 <0.288.0> <0.291.0> ** Generic server pam_server terminating
** Last message in was {close,99641}
** When Server state == {state,99654,#Port<0.1087625>,"AS",[],"common-auth",
[]}
** Reason for termination ==
** {badarg,[{erlang,port_command,[#Port<0.1087625>,[99,"99641",0]],[]},
{pam_server,handle_call,3,[]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,601}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
{"17-Apr-2016::20:45:33.961",
{error_report,<0.288.0>,
{<0.291.0>,crash_report,
[[{initial_call,{pam_server,init,['Argument__1']}},
{pid,<0.291.0>},
{registered_name,pam_server},
{error_info,
{exit,
{badarg,
[{erlang,port_command,
[#Port<0.1087625>,[99,"99641",0]],
[]},
{pam_server,handle_call,3,[]},
{gen_server,handle_msg,5,
[{file,"gen_server.erl"},{line,601}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]},
[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,760}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]}},
{ancestors,[pam_sup,<0.289.0>]},
{messages,[{'EXIT',#Port<0.1087625>,normal}]},
{links,[<0.290.0>]},
{dictionary,[]},
{trap_exit,true},
{status,running},
{heap_size,1598},
{stack_size,27},
{reductions,12995273}],
[]]}}}
{"17-Apr-2016::20:45:33.963",
{error_report,<0.288.0>,
{<0.290.0>,supervisor_report,
[{supervisor,{local,pam_sup}},
{errorContext,child_terminated},
{reason,
{badarg,
[{erlang,port_command,[#Port<0.1087625>,[99,"99641",0]],[]},
{pam_server,handle_call,3,[]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,601}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]}},
{offender,
[{pid,<0.291.0>},
{name,pam_server},
{mfargs,{pam_server,start_link,[]}},
{restart_type,permanent},
{shutdown,2000},
{child_type,worker}]}]}}}
{"17-Apr-2016::20:45:33.964",
{error_report,<0.288.0>,
{<0.290.0>,supervisor_report,
[{supervisor,{local,pam_sup}},
{errorContext,shutdown},
{reason,reached_max_restart_intensity},
{offender,[{pid,<0.291.0>},
{name,pam_server},
{mfargs,{pam_server,start_link,[]}},
{restart_type,permanent},
{shutdown,2000},
{child_type,worker}]}]}}}
17-Apr-2016::20:45:35.240 <0.28.0> <0.121.0> confd_ia:494: Couldn't map 4 ("CLI") to server: {error,
undefined}
[{confd_ia,handle_connection,6,[]},
{confd_ia,acceptor,5,[]},
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]
17-Apr-2016::20:45:36.270 <0.9.0> <0.164.0> ha_vip: PID shutdown died restarting {"17-Apr-2016::20:45:36.276",
{error_report,<0.28.0>,
{<0.11571.157>,crash_report,
[[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},
{pid,<0.11571.157>},
{registered_name,[]},
{error_info,
{exit,
{noproc,
{gen_server,call,[cdb,{req,close_usess,1,[]},infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},
{lists,foreach,2,[{file,"lists.erl"},{line,1323}]},
{cs_usess,cleanup_usess,3,[]},
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},
{ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,false},
{status,running},
{heap_size,610},
{stack_size,27},
{reductions,184}],
[]]}}}
17-Apr-2016::20:45:36.283 <0.9.0> <0.754.0> restarting pnp server
{"17-Apr-2016::20:45:36.288",
{error_report,<0.28.0>,
{<0.10924.157>,crash_report,
[[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},
{pid,<0.10924.157>},
{registered_name,[]},
{error_info,
{exit,
{noproc,
{gen_server,call,[cdb,{req,close_usess,35,[]},infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},
{lists,foreach,2,[{file,"lists.erl"},{line,1323}]},
{cs_usess,cleanup_usess,3,[]},
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},
{ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,false},
{status,running},
{heap_size,376},
{stack_size,27},
{reductions,167}],
[]]}}}
17-Apr-2016::20:45:36.293 <0.9.0> <0.10376.157> econfd_internal:79: econfd_internal: Client Id 3 not registered by confd_ia: {error,
undefined}
[{econfd_internal,connect,4,[]},
{econfd_daemon,ctl_cnct,3,[]},
{econfd_daemon,do_cnct,3,[]},
{econfd_daemon,init,1,[]},
{gen_server,init_it,6,[{file,"gen_server.erl"},{line,315}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]
{"17-Apr-2016::20:45:36.295",
{error_report,<0.9.0>,
{<0.10376.157>,crash_report,
[[{initial_call,{econfd_daemon,init,['Argument__1']}},
{pid,<0.10376.157>},
{registered_name,[]},
{error_info,
{exit,econnrefused,
[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,331}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]}},
{ancestors,[pnp,<0.754.0>,<0.173.0>,<0.172.0>]},
{messages,[]},
{links,[<0.11939.157>]},
{dictionary,[{sock_int,1}]},
{trap_exit,false},
{status,running},
{heap_size,376},
{stack_size,27},
{reductions,147}],
[{neighbour,
[{pid,<0.11939.157>},
{registered_name,pnp},
{initial_call,{pnp,init,['Argument__1']}},
{current_function,{proc_lib,sync_wait,2}},
{ancestors,[<0.754.0>,<0.173.0>,<0.172.0>]},
{messages,[{ack,<0.10376.157>,{error,econnrefused}}]},
{links,[<0.10376.157>]},
{dictionary,[]},
{trap_exit,false},
{status,runnable},
{heap_size,233},
{stack_size,19},
{reductions,92}]}]]}}}
{"17-Apr-2016::20:45:36.296",
{error_report,<0.9.0>,
{<0.754.0>,crash_report,
[[{initial_call,{pnp,'-start/0-fun-0-',[]}},
{pid,<0.754.0>},
{registered_name,[]},
{error_info,
{error,
{badmatch,{error,econnrefused}},
[{pnp,monitor_pnp,1,[{file,"pnp.erl"},{line,333}]},
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},
{ancestors,[<0.173.0>,<0.172.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,false},
{status,running},
{heap_size,376},
{stack_size,27},
{reductions,113}],
[]]}}}
{"17-Apr-2016::20:45:36.297",
{error_report,<0.28.0>,
{<0.11881.157>,crash_report,
[[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},
{pid,<0.11881.157>},
{registered_name,[]},
{error_info,
{exit,
{noproc,
{gen_server,call,[cdb,{req,close_usess,34,[]},infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},
{lists,foreach,2,[{file,"lists.erl"},{line,1323}]},
{cs_usess,cleanup_usess,3,[]},
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},
{ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,false},
{status,running},
{heap_size,376},
{stack_size,27},
{reductions,175}],
[]]}}}
{"17-Apr-2016::20:45:36.298",
{error_report,<0.28.0>,
{<0.12740.157>,crash_report,
[[{initial_call,{cs_usess,'-handle_info/2-fun-0-',[]}},
{pid,<0.12740.157>},
{registered_name,[]},
{error_info,
{exit,
{noproc,
{gen_server,call,[cdb,{req,close_usess,16,[]},infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,199}]},
{lists,foreach,2,[{file,"lists.erl"},{line,1323}]},
{cs_usess,cleanup_usess,3,[]},
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,224}]}]}},
{ancestors,[cs_usess,confd_second_sup,confd_sup,<0.29.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,false},
{status,running},
{heap_size,376},
{stack_size,27},
{reductions,167}],
[]]}}}
17-Apr-2016::20:45:38.113 <0.28.0> <0.32.0> init:boot_msg: "Internal error: Supervision terminated\n"17-Apr-2016::20:46:18.586 <0.288.0> <0.291.0> epam port program died
17-Apr-2016::20:46:41.110 <0.288.0> <0.291.0> epam port program died
… SNIP
BR,
Bostjan
Ah, forgot to ask again the original Q. Where can I see what NSO is currently processing in Java VM, what is getting it stuck? Can I see transaction backlog that got formed during that time?
BR,
Bostjan
Please do - it would be great if we could nail down this problem.
--Per
Hi,
Ah, forgot to ask again the original Q. Where can I see what NSO is currently processing in Java VM, what is getting it stuck? Can I see transaction backlog that got formed during that time?
Have you turned on the java logging.
That can be done via the CLI
Now the java-vm can be occupied with a lot of task depending on how many packages you have.
Also it would be good to narrow it down to some apis, for instance:
To debug log cli or generic ned calls:
set java-vm java-logging logger com.tailf.ned level level-debug
For services (or other Dp callbacks):
set java-vm java-logging logger com.tailf.dp level level-debug
For cdb subscribing application components
set java-vm java-logging logger com.tailf.cdb level level-debug
You can expect of logging output and if something in java deadlocks or waits can be more or less hard to pinpoint.
Another way connect a debugger the java-vm (description in the devel guide)
Then you can find individual threads and suspend some suspects to see what they do.
jvisualvm is a tool that you can connect to the vm an see individual threads and some info of what they do.
/Try
Hi Bostjan, Per,
We have an SR for the same.
I think there are two issues here.
One is where we have "Internal error: PAM error - "epam died² ‹> from reading the thread:
.. this seems to be a day-1 issue and has not been resolved.
.. epam is being used heavily (mainly for access by northbound devices, read portals)
Q1: Do we have a bug on this one (internal or CDETS)?
The other one is a race condition where NCS restarted because epam died.
The support team identified it as #14395.
This is what I got from them:
"This epam client died "exactly" when the request, initiated by the webui, was sent to the epam but it failed since it just terminated and hadn't been restarted yet. ³
Q2: I guess the request could come from anywhere to trigger this restart (not only webui), right?
FYI, in logs, I see frequently ³epam died² messages:
<ERR> 24-Apr-2016::02:04:14.933 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 24-Apr-2016::08:43:50.033 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 25-Apr-2016::23:49:42.848 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 26-Apr-2016::18:14:35.177 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 27-Apr-2016::07:49:37.606 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 28-Apr-2016::12:17:06.487 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 29-Apr-2016::06:27:21.107 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 2-May-2016::02:29:28.093 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 3-May-2016::00:17:48.362 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 3-May-2016::00:17:48.364 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
<ERR> 3-May-2016::08:21:05.173 szg-dr-bdc-ncsf-1-vms102 ncs[28083]: - Internal error: PAM error - "epam died"
Thanks,
Peter
Hi Peter,
I opened a defect for epam dying (CSCuz29444).
Cant comment on your second Q.
BR,
Bostjan
Hi Peter,
About your Q2 - the "request from webui" explanation was probably due to an incorrect initial analysis by me - in fact in the current code, the issue in our ticket #14395 is *never* trigged by the webui, but only by SSH sessions, i.e. NETCONF or CLI access via the NSO-internal SSH server - and specifically when such a session terminates.
Regarding Q1, the "epam died" in general, this can be triggered by any NB agent using password authentication when PAM is enabled. Since we have not been able to reproduce this, we need help from someone that encounters it, as a first step trying to get hold of a core dump from "epam" when it dies. I believe the support team has discussed with you regarding how to go about this.
--Per