Hi,
We have a onep application which usually works just fine. Now, we are having a problem connecting to the router. It seems like the router does not clean up the state sufficiently. As can be seen below, there has been an error for a connection and now there are many TCP connections (port 15002 for TLS) in the CLOSEWAIT state.
It seems like a reboot of the router is necessary to get back to a normal state? Is this a known problem?
R2#show onep session all
R2#show onep statistics
Active sessions: 0
Established sessions: 18
Total session disconnects: 18
Admin initiated disconnects: 0
Remote disconnects: 0
Error disconnects: 18
Total errors: 1
Authentication errors: 0
Duplicate application name error 1
Memory errors 0
Internal errors 0
Rate limiting:
Total TCP connects: 37
Rejected connects: 0
Accepted connects: 0
Unaffected connects: 37
Most recent failed connection attempts:
Connection #1 attempted Sun Sep 21 08:48:49 2014
Remote host: 20.5.2.242
Reason: Internal system error, API Channel failed to transition to Connecting state for session test.app-UCS-E-R2-9454
Reason code: 0
Connection sequence number: 37
R2#
R2#show tcp brief
TCB Local Address Foreign Address (state)
21DD9EC8 20.5.2.241.15002 20.5.2.242.45802 CLOSEWAIT
C195FFDC 20.5.2.241.23 20.5.2.242.58036 ESTAB
3DD524E8 20.5.2.241.15002 20.5.2.242.45803 CLOSEWAIT
21E3D0E4 20.5.2.241.15002 20.5.2.242.45805 CLOSEWAIT
41158A64 20.5.2.241.15002 20.5.2.242.45804 CLOSEWAIT
40CD3424 20.5.2.241.15002 20.5.2.242.45800 CLOSEWAIT
C01E14A8 20.5.2.241.15002 20.5.2.242.45806 CLOSEWAIT
R2#
R2#show onep status
Status: enabled by: Config
Version: 1.2.0
Transport: tls; Status: running; Port: 15002; localcert: TP-self-signed-3937507470; client cert validation disabled
Certificate Fingerprint SHA1: 90F9692E 942D0DD4 274D7632 EDAC0467 5AE43F70
Transport: tipc; Status: disabled
Session Max Limit: 10
CPU Interval: 0 seconds
CPU Falling Threshold: 0%
CPU Rising Threshold: 0%
History Buffer: Enabled
History Buffer Purge: Oldest
History Buffer Size: 32768 bytes
History Syslog: Disabled
History Archived Session: 16
History Max Archive: 16
Trace buffer debugging level is info
Service Set: Base State: Enabled Version 1.2.0
Service Set: Vty State: Disabled Version 0.1.0
Service Set: Mediatrace State: Disabled Version 1.0.0
R2#
R2#show version
Cisco IOS Software, C2900 Software (C2900-UNIVERSALK9-M), Version 15.4(2)T, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Wed 26-Mar-14 14:14 by prod_rel_team
ROM: System Bootstrap, Version 15.0(1r)M16, RELEASE SOFTWARE (fc1)
R2 uptime is 2 weeks, 4 days, 18 hours, 14 minutes
You mention that this is only one application. Do you have other applications not exhibiting this problem? How do you close the onePK application session in your code?
Hi Joseph,
The router has not been rebooted yet, and in the current state no application can connect to the router, including tutorial applications. When an application is now started, the TCP connection is established (by TLS) as I can see both on the PC (netstat) and the router (show tcp), but nothing more happens. When the application is then terminated, the connection is cleaned up on the PC as it should, while on the router the TCP connection enters the CLOSEWAIT state, where it remains.
The onePK application session is terminated in different ways, including externally with Ctrl-C. It might be that some edge case has been triggered somehow. However, one would expect that no matter how the session is terminated, the router would clean up the resources and get back into normal operation?
Best regards
Viktor
Yes, I agree. The router should not be seeing stale TCP sessions. However, if you can determine what type of disconnect causes this, it will be easier to reproduce and file a bug.
Hi Joseph,
I think it would be reasonable that someone have a look at the code running on the router relevant for the error message posted in the first email, that is:
Most recent failed connection attempts:
Connection #1 attempted Sun Sep 21 08:48:49 2014
Remote host: 20.5.2.242
Reason: Internal system error, API Channel failed to transition to Connecting state for session test.app-UCS-E-R2-9454
Reason code: 0
Connection sequence number: 37
More specifically, to see what happens in the code when the API Channel tries to transition to Connecting state, but fails. In other words, whether the required cleanup is performed in this case or not.
The error message seems to indicate a failure in between TCP connection establishment and higher level (TLS / internal) connection establishment. The failure might be what causes the router in the following to not accept any more connections from onep applications.
Best regards
Viktor
It does appear that the same cleanup may not be done in all cases when this error occurs. Can you confirm that the stale sessions always come about with this error?
I am not sure whether the stale TCP sessions always come about with this error or not. Based on the time of the error, it seems plausible that they are related. We will follow up if and when something similar shows up again. Please let us know if this is somehow identified in the code and fixed in an upcoming version.
Best regards
Viktor
Please also collect 'show onep trace/error' and/or 'debug onep server session level debug' log for more clues.
Seems the client socket is cosed abrupt, leaving close waits on server. The outstanding close wait socket seems small number but exhausted limit, that system can no more spawn ephemeral ports is little surprising. We need to check shutdown is used instead of close when session I/o fails. Please raise a bug to track this.
I filed CSCur07539 to track this pending the additional information from Viktor.
OK, fine. Thanks Joseph.
Best regards
Viktor
Viktor, can you post the show and debug output Atul requested when you see this problem? Thanks.
My hunch is close-wait might not be real issue in here. From the error message, seems its complaining about oneP state machine level error.
I would like to understand in more details how this problem state was reached, and if the scenario is repeatable.
What is intriguing is why no other app would be able to connect..
Debug logs surely will help get more clues please.
Btw, you might want to try "conf t; no onep" instead of reboot of the device to come out of the problematic state.
Yes, we will try to collect more information. We have not seen this error since we reported it. The router was rebooted, which as expected resolved the problem. Next time we will try "conf t; no onep" as suggested.
To speculate, it might be possible to provoke the failure by terminating the app/session after TCP connection establishment, but before TLS reaches the connected state, but we have not tried that so far.
Best regards
Viktor
Thanks Viktor, could you please fill me in, what version of image and SDK you have please.
Regards,
Atul.
The SDK for development was sdk-c64-1.2.1.194. The router image was listed in the initial post for this thread:
R2#show version
Cisco IOS Software, C2900 Software (C2900-UNIVERSALK9-M), Version 15.4(2)T, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Wed 26-Mar-14 14:14 by prod_rel_team
"ROM: System Bootstrap, Version 15.0(1r)M16, RELEASE SOFTWARE (fc1)
R2 uptime is 2 weeks, 4 days, 18 hours, 14 minutes
Best regards
Viktor