cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
778
Views
0
Helpful
6
Replies

ONEP TLS recv error and session exit

Hi,

We experience TLS disconnects, where the ONEP TLS connection between the onep application running on a UCS and the router fails after some time, varying between minutes and hours. The output from the commands 'show onep error' on the router and 'show onep statistics' are included below. We experience the problem with 2911 routers and code built using the 64-bit c (see below). Is this a known issue? We did not see any reference to the problem in the release notes and we do not have access to any bug report database, where this issue might have been reported.

[08/25/14 10:17:21.256 70D3] [tid: 30] [session id: 5900][ONEP][Session]: ONEP TLS recv error <-6992> for fd: 1, pid: 30 [onep_al_tls_recv:118]

[08/25/14 10:17:21.256 70D4] [tid: 30] [session id: 5900][ONEP][Session]: ONEP TLS recv error <-6992> for fd: 0, pid: 30 [onep_al_tls_recv:118]

[08/25/14 10:17:21.268 70D5] [tid: 30] [session id: 5900][ONEP][Session]: test.app-UCS-E-R2-5900: Element session_exit failed [onep_session_manager_session_exit:420]

[08/25/14 10:17:21.292 70D6] [tid: 30] [session id: 5900][ONEP][Session]: test.app-UCS-E-R2-5900: session_update fail [5900] [onep_session_manager_session_update:368]

R2#show onep statistics

Active sessions: 1

Established sessions: 86

Total session disconnects: 86

  Admin initiated disconnects: 0

  Remote disconnects: 2

  Error disconnects: 84

Total errors: 0

  Authentication errors: 0

  Duplicate application name error 0

  Memory errors 0

  Internal errors 0

Rate limiting:

  Total TCP connects: 172

  Rejected connects: 0

  Accepted connects: 0

  Unaffected connects: 172

onePK sdks, both:

onePK-sdk-c-rel-1.2.0.173.20140326-lnx-x86_64

onePK-sdk-c-rel-1.2.1.194.20140416-lnx-x86_64

R2#show version

Cisco IOS Software, C2900 Software (C2900-UNIVERSALK9-M), Version 15.4(2)T, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

R2#show onep status

Status: enabled by: Config

Version: 1.2.0

Transport: tls; Status: running; Port: 15002; localcert: TP-self-signed-3937507470; client cert validation disabled

Certificate Fingerprint SHA1: 90F9692E 942D0DD4 274D7632 EDAC0467 5AE43F70

Transport: tipc; Status: disabled

Session Max Limit: 10

CPU Interval: 0 seconds

CPU Falling Threshold: 0%

CPU Rising Threshold: 0%

History Buffer: Enabled

History Buffer Purge: Oldest

History Buffer Size: 32768 bytes

History Syslog: Disabled

History Archived Session: 16

History Max Archive: 16

Trace buffer debugging level is info

Service Set: Base               State: Enabled     Version 1.2.0

Service Set: Vty                State: Disabled    Version 0.1.0

Service Set: Mediatrace         State: Disabled    Version 1.0.0

Best regards

Viktor

Everyone's tags (1)
6 REPLIES 6
Beginner

Re: ONEP TLS recv error and session exit

Hi Viktor,

I didn't find any other reports of this error internally.  Could you possibly post your code so we can try to replicate?  Also, as a workaround, you can use the onep_element_reconnect() function to get back to the same session after an abnormal disconnect.


Thanks,

Dave

Re: ONEP TLS recv error and session exit

Hi David,

From what I can see, the onep_element_connect function starts a new thread within the onep application library. The TLS disconnect seems to be initiated from the onep application library side, that is, not from the router side. From what I understand, the keep alive mechanism might tear down the connection, for example, if the router does not respond. However, I have tried to configure the settings via the onep_session_config_set_keepalive so that this should not happen. That does not seem to have much effect - I still experience TLS disconnects.

Under what circumstances will the thread handling the onep_element_connect exit? Is there any way to determine the cause of the disconnect / thread exit?

We did try to register a connection event listener, via the onep_element_set_connection_listener function, to trigger a reconnect to the network element (router) by using the onep_element_reconnect function (as suggested). When the TLS disconnect occurs, the event listener is invoked and the application re-connects to the network element.

However, that does not explain the cause of the abnormal TLS disconnects.

Note that we experience this problem for an application which uses the data path service set, but we might try to reproduce the failure for a simpler test case.

Best regards

Viktor

Beginner

Re: ONEP TLS recv error and session exit

Hi Viktor,

Glad to hear your workaround works, and agree that it’s just a patch until the root cause is discovered.  I think it would be valuable for you to try to replicate with a simpler test case and/or different hardware, but if you do want us to investigate further on our end then we’ll need to see your code.  Also, the following dumps would be helpful:

#show version

#show running-config brief

Thanks,

Dave

Re: ONEP TLS recv error and session exit

Hi again David,

The TLS connection is terminated on the onep application side with the following error and log messages:

Wed Sep  3 13:26:15 2014:[ONEP][network_element][ERROR]: test.app-UCS-E-R1-2394: [EventReceiver_thread]: 1337[2104850176]: Error -1 in select call 4.

Wed Sep  3 13:26:15 2014:[ONEP][network_element][DEBUG]: test.app-UCS-E-R1-2394: [network_element_disconnect_by_receiver]: 1315[2104850176]: Disconnect by receiver

Wed Sep  3 13:26:15 2014:[ONEP][network_element][DEBUG]: test.app-UCS-E-R1-2394: [ne_fsm_disconnected]: 1415[2104850176]: FSM: [Connected] ==> [Disconnected]

Wed Sep  3 13:26:15 2014:[ONEP][network_element][DEBUG]: test.app-UCS-E-R1-2394: [ne_fsm_disconnected]: 1451[2104850176]: Disconnected with handle [2394] at Wed Sep  3 13:26:15 2014

An strace of the process shows the following for the select call:

[pid  3076] 13:26:15.315737 select(9, [8], NULL, NULL, {1, 0}) = ? ERESTARTNOHAND (To be restarted)

File descriptor 9 seems to be for select / epoll used by the EventReceiver_thread.

It seems like profiling is triggering the ERESTARTNOHAND behavior. Is this something that the EventReceiver_thread should handle by restarting (implicitly) the select?

Best regards

Viktor

Beginner

Re: ONEP TLS recv error and session exit

Hi Viktor,

From the one line trace snippet, I don’t see anything wrong with the select call itself.  The 9 is not a file descriptor but rather the highest numbered fd in the following lists of read/write/except fds  +1.  So it’s waiting on a single descriptor (8) during a read, which is interrupted by a signal.  But it really should be converting ERESTARTNOHAND to EINTR which is normally what gets returned to user land.  I don’t have access to the source code (yet), but suspect it is doing that from the error line in the log.  Select() should never return -1, but the 4 is EINTR.

And by saying it seems like the profiling is triggering the ERESTARTNOHAND behavior, are you saying the behavior is different when you don’t run it via strace?  If not, I would think this is a bug.

Anyway, what would be interesting to see now is:

  • Any strace lines that follow the select. These should show the signal handler that was called during the interrupt.
  • The model/series of UCS you’re running on.  And does it have the latest updates?

Also, did you ever get around to reproducing with a simpler test case, and/or different hardware?

Thanks,

Dave

Highlighted

Re: ONEP TLS recv error and session exit

Hi David,

Sorry, I did mean to write file descriptor 8. Yes, as far as I also know ERESTARTNOHAND should not be returned to user land.

I did attach strace to the running process, but that did not seem to change anything. That is, the "Error -1 in select call 4" happens both with and without strace running, when profiling is enabled.

The strace line following is from profiling.

[pid  3076] 13:26:15.315737 select(9, [8], NULL, NULL, {1, 0}) = ? ERESTARTNOHAND (To be restarted)

[pid  3076] 13:26:15.316180 --- SIGPROF (Profiling timer expired) @ 0 (0) ---

I have now been running the system overnight since I left work yesterday, where profiling is not enabled. Without the profiling, I have not seen the error so far.

So far, we have not tried with different hardware or reproduced with a simpler test case.

The system setup is as follows:

We are using onepk sdk-c64-1.2.1.194

2911 router:

R1#show version

Cisco IOS Software, C2900 Software (C2900-UNIVERSALK9-M), Version 15.4(2)T, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2014 by Cisco Systems, Inc.

Compiled Wed 26-Mar-14 14:14 by prod_rel_team

ROM: System Bootstrap, Version 15.0(1r)M16, RELEASE SOFTWARE (fc1)

UCS-E installed in 2911 router, running Ubuntu 12.04:

From lshw:

ucs-e-r1

    description: Expansion Chassis

    product: UCS-E140S-M1/K9 (UCS-E140S-M1/K9)

    vendor: Cisco Systems, Inc.

    version: M1

    serial: FOC17494WVD

    width: 64 bits

    capabilities: smbios-2.7 dmi-2.7 vsyscall32

    configuration: administrator_password=disabled boot=normal chassis=expansion family=UCS E-Series frontpanel_password=enabled keyboard_password=disabled power-on_password=disabled sku=UCS-E140S-M1/K9 uuid=E02F6DE0-E980-0000-3B2E-286873DDCF62

  *-core

       description: Motherboard

       product: UCS-E140S-M1/K9

       vendor: Cisco Systems, Inc.

       physical id: 0

       version: M1

       serial: FOC17494WVD

       slot: Unknown

     *-firmware

          description: BIOS

          vendor: Cisco Systems, Inc.

          physical id: 0

          version: UCSES.1.5.0.2.051520131758

          date: 05/15/2013

          size: 64KiB

          capacity: 4032KiB

          capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification

From uname -a:

Linux UCS-E-R1 3.5.0-54-generic #81~precise1-Ubuntu SMP Tue Jul 15 04:02:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Note: We have tried to use the newest kernel for Ubuntu 12.04 for the UCS, but the system crashes / freezes, usually within an hour with the 3.13 kernel, including the latest one:

Linux UCS-E-R1 3.13.0-35-generic #62~precise1-Ubuntu SMP Mon Aug 18 14:52:04 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Best regards

Viktor

Content for Community-Ad
August's Community Spotlight Awards
This widget could not be displayed.