on 11-13-2024 12:05 PM
When interacting with devices NSO implements a timeout mechanism to handle communication between different components and network elements.
The primary components involved when communicating with devices are mainly:
It is crucial to note that whenever there is any "communication" path, there must be some timeout to ensure that any element part of this communication is handled if it goes unresponsive or is slower than acceptable.
As such, NSO implements timeouts between:
These timeouts are mainly based on three well-known settings available as device configs under /devices/device/ned-settings and /devices/global-settings/ned-settings:
These three settings are available both as global settings or "per-device", with a specific-first approach: a per-device setting is preferred over a global setting.
The easiest way to understand which timeout values are used when interacting with a device is to look at a device trace file. Specifically, when the NED initially tries to connect to a device (in the CONNECT phase), a line like the following one is logged in the trace file:
>> 29-Aug-2024::10:33:12.784 CLI CONNECT to ios-127.0.0.1:22 as cisco (Trace=true)
...
-- connect-timeout 20000 read-timeout 20000 write-timeout 20000
You can see in the above example that the three timeouts are all set to 20 seconds (values are expressed in milliseconds)
How these timeouts are used depends on the implementation, and which one is actively counted against depends on which phase of a device interaction we are in.
There are three phases, which are closely related to the three timeouts:
Specifically, it is important to understand that these timeouts are not absolute timeouts, which means we need to take into account the timeouts in relation to what is being performed. What this means is that if we are connecting/reading/writing to a device, the whole operation COULD potentially take longer than the single read/write/connect timeouts, as long as the atomic operations take less than the timeouts.
To make an example, let's suppose the NED is connecting to the device and needs to set up an SSH connection. Let's assume that the connect-timeout is set to 20 seconds. Let's lay out three scenarios:
Out of these three scenarios, a and b were both successful. Even though scenario c took the same absolute time as scenario b, it was unsuccessful. This shows why we should never consider these timeouts as "absolute" timeouts, and frequently users might be surprised to find out that device sessions are not timing out as they expect to, but that is the way these timeouts should be considered.
Another important aspect of understanding how these timeouts work is "when they are reset". This is a bit more complex to explain, but to simplify:
We can identify the device's prompt as "what the NED" is instructed to match as the pattern of such prompt. Every device has its specificity, and generally speaking, we can consider a prompt as what the device uses as a prompt on its own CLI implementation.
To make an example, a prompt might look like:
*A:A_DEVICE_HOSTNAME>
or
AN_IOS_DEVICE#
As said, the specificity is built into the NED itself, which is programmed to define which prompts are expected to appear on the CLI of a given device family, for a given CLI state (i.e. an exec prompt, a config mode prompt, etc.).
Having said that, we can easily make a simplified example about when we would start counting towards a timeout, and when we might reset the read- and write-timeouts.
Let's suppose our NED wants to apply some config changes on a device as a result of a commit on NSO. Regardless of what the changes are from the NSO perspective, we know that the config changes we want to push towards the device are:
section_1
change_here
configuration_node ACME
exit
exit
This is an example of some arbitrary native-format configuration that our NED wants to push to the device.
Assuming the NED connects to the device and is in config mode, the first thing the NED does is to send out commands on the CLI session all at once in so-called 'chunks' (per ned-settings).
At this point in time we start counting against the write-timeout.
After all the commands in the current chunk are sent, the NED will start reading data from the stdout. What the NED will read is the characters that are "appearing" on the CLI, line by line (so it will read the chars until a newline character is encountered).
Every time a line is read, the NED will try to match the content of that line against a pattern equal to what was sent by the NED.
Every time such a match is made between that line and the expected pattern, the write-timeout is reset.
In the proposed example, the series of events will be as follows (assuming the NED has already connected, and sent the commands necessary to enter config mode on the device) :
This process basically ensures that the commands that we send are echoed back on the CLI.
Assuming all commands were sent correctly and accepted by the device, the following would have appeared on the CLI:
device-prompt(config)# section_1
device-prompt(config-section)# change_here
device-prompt(config-section)# configuration_node ACME
device-prompt(config-section)# exit
device-prompt(config)# exit
device-prompt#
NSO also uses timeouts to manage its communication with the NED (between Erlang VM ↔ Java VM). This timeout does not assume a fixed value, but we can generally assume the following formula to be true:
2 * (max(connect-timeout, read-timeout) + 3s)
Based on this formula, and assuming that we are using the default timeout values, if the NED doesn't respond to NSO within 46s NSO considers something is going wrong with the NED. It will, thus, close the socket, etc.
Keep in mind that the NED has the ability to extend such timeout, so the formula above only works for the "first round" or generally speaking "until the NED has requested for a new timeout for the first time".
As such, as soon as we see a reset of the Ned Worker timeout, we should assume that the formula used for this timeout will be:
2 * NewTimeout
Where NewTimeout is the value of the timeout sent by the NED. The Ned will use read-timeout and write-timeout interchangeably, depending on the operation that is being performed.
The request for a new timeout from the NED towards NSO can be easily identified in a device trace, with a log line such as:
<< 29-Aug-2024::10:33:12.989 user: admin/123456 thandle 123456 hostname nso device device_name trace-id=- SET_TIMEOUT
The reason for such timeout to be at least two times the requested timeout is to give the NED enough headroom to perform the necessary operations, while still keeping the system under control if ever the NED becomes unresponsive or anything goes wrong.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the NSO Developer community: