NSO device timeouts demystified

mafilipp · ‎11-13-2024

Some of the concepts in this article apply to most NSO device types, but the article is specifically tailed towards CLI devices.

When interacting with devices NSO implements a timeout mechanism to handle communication between different components and network elements.
The primary components involved when communicating with devices are mainly:

NEDs
Devices
NedMux

It is crucial to note that whenever there is any "communication" path, there must be some timeout to ensure that any element part of this communication is handled if it goes unresponsive or is slower than acceptable.
As such, NSO implements timeouts between:

DEVICE <-> NED
NED <-> NedMux

These timeouts are mainly based on three well-known settings available as device configs under /devices/device/ned-settings and /devices/global-settings/ned-settings:

connect-timeout
read-timeout
write-timeout

These three settings are available both as global settings or "per-device", with a specific-first approach: a per-device setting is preferred over a global setting.

The easiest way to understand which timeout values are used when interacting with a device is to look at a device trace file. Specifically, when the NED initially tries to connect to a device (in the CONNECT phase), a line like the following one is logged in the trace file:

>> 29-Aug-2024::10:33:12.784 CLI CONNECT to ios-127.0.0.1:22 as cisco (Trace=true)
...
-- connect-timeout 20000 read-timeout 20000 write-timeout 20000

You can see in the above example that the three timeouts are all set to 20 seconds (values are expressed in milliseconds)

How these timeouts are used depends on the implementation, and which one is actively counted against depends on which phase of a device interaction we are in.

There are three phases, which are closely related to the three timeouts:

Connect - we are actively trying to set up a connection to the device
Read - the Ned is trying to read something from the device
Write - the Ned is trying to write something to the device

Specifically, it is important to understand that these timeouts are not absolute timeouts, which means we need to take into account the timeouts in relation to what is being performed. What this means is that if we are connecting/reading/writing to a device, the whole operation COULD potentially take longer than the single read/write/connect timeouts, as long as the atomic operations take less than the timeouts.

To make an example, let's suppose the NED is connecting to the device and needs to set up an SSH connection. Let's assume that the connect-timeout is set to 20 seconds. Let's lay out three scenarios:

Connecting to the device takes 10 seconds in total: 9 seconds to exchange keys, 1 second for everything else
Connecting to the device takes 30 seconds in total: 19 seconds to exchange keys, 11 seconds for everything else
Connecting to the device takes 30 seconds in total: 25 seconds to exchange keys, 5 seconds for everything else

Out of these three scenarios, a and b were both successful. Even though scenario c took the same absolute time as scenario b, it was unsuccessful. This shows why we should never consider these timeouts as "absolute" timeouts, and frequently users might be surprised to find out that device sessions are not timing out as they expect to, but that is the way these timeouts should be considered.

Another important aspect of understanding how these timeouts work is "when they are reset". This is a bit more complex to explain, but to simplify:

The read-timeout is reset when the NED "reads" the device's prompt
The write-timeout is reset when the NED "reads" an echo on the CLI

We can identify the device's prompt as "what the NED" is instructed to match as the pattern of such prompt. Every device has its specificity, and generally speaking, we can consider a prompt as what the device uses as a prompt on its own CLI implementation.

To make an example, a prompt might look like:

*A:A_DEVICE_HOSTNAME>

or

AN_IOS_DEVICE#

As said, the specificity is built into the NED itself, which is programmed to define which prompts are expected to appear on the CLI of a given device family, for a given CLI state (i.e. an exec prompt, a config mode prompt, etc.).

Having said that, we can easily make a simplified example about when we would start counting towards a timeout, and when we might reset the read- and write-timeouts.

Let's suppose our NED wants to apply some config changes on a device as a result of a commit on NSO. Regardless of what the changes are from the NSO perspective, we know that the config changes we want to push towards the device are:

section_1
 change_here
  configuration_node ACME
 exit
exit

This is an example of some arbitrary native-format configuration that our NED wants to push to the device.

Assuming the NED connects to the device and is in config mode, the first thing the NED does is to send out commands on the CLI session all at once in so-called 'chunks' (per ned-settings).
At this point in time we start counting against the write-timeout.
After all the commands in the current chunk are sent, the NED will start reading data from the stdout. What the NED will read is the characters that are "appearing" on the CLI, line by line (so it will read the chars until a newline character is encountered).
Every time a line is read, the NED will try to match the content of that line against a pattern equal to what was sent by the NED.
Every time such a match is made between that line and the expected pattern, the write-timeout is reset.

In the proposed example, the series of events will be as follows (assuming the NED has already connected, and sent the commands necessary to enter config mode on the device) :

The NED will send the first chunk of data (in our case it's all of the configuration, but the concept remains the same with multiple chunks)
Expect the device config prompt to appear
Expect the line "section_1" to appear
Expect the device config prompt to appear
Expect the line "change_here" to appear
Expect the device config prompt to appear
Expect the line "configuration_node ACME" to appear
Expect the device config prompt to appear
Expect the line "exit" to appear
Expect the device config prompt to appear
Expect the line "exit" to appear
Expect the device exec prompt to appear

This process basically ensures that the commands that we send are echoed back on the CLI.

Assuming all commands were sent correctly and accepted by the device, the following would have appeared on the CLI:

device-prompt(config)# section_1
device-prompt(config-section)# change_here
device-prompt(config-section)# configuration_node ACME
device-prompt(config-section)# exit
device-prompt(config)# exit
device-prompt#

NSO also uses timeouts to manage its communication with the NED (between Erlang VM ↔ Java VM). This timeout does not assume a fixed value, but we can generally assume the following formula to be true:

2 * (max(connect-timeout, read-timeout) + 3s)

Based on this formula, and assuming that we are using the default timeout values, if the NED doesn't respond to NSO within 46s NSO considers something is going wrong with the NED. It will, thus, close the socket, etc.
Keep in mind that the NED has the ability to extend such timeout, so the formula above only works for the "first round" or generally speaking "until the NED has requested for a new timeout for the first time".
As such, as soon as we see a reset of the Ned Worker timeout, we should assume that the formula used for this timeout will be:

2 * NewTimeout

Where NewTimeout is the value of the timeout sent by the NED. The Ned will use read-timeout and write-timeout interchangeably, depending on the operation that is being performed.
The request for a new timeout from the NED towards NSO can be easily identified in a device trace, with a log line such as:

<< 29-Aug-2024::10:33:12.989 user: admin/123456 thandle 123456 hostname nso device device_name trace-id=- SET_TIMEOUT

The reason for such timeout to be at least two times the requested timeout is to give the NED enough headroom to perform the necessary operations, while still keeping the system under control if ever the NED becomes unresponsive or anything goes wrong.