Solved: NSO Configuration Errors and Rollbacks Due to Device Prompt Following Write Command

jmaruschak · ‎03-09-2020

Greetings,

We're experiencing issues with Cisco ASR920 devices when applying configuration changes through NSO. NSO appears to get stuck with a device prompt during the commit phase. The configuration is then rolled back. This happens most frequently with automated scripts that make changes to customer configurations on a certain schedule, but users occasionally report that they receive the error on the CLI when making configuration changes. The following error appears when this occurs:

External error in the NED implementation for device -----:

read timeout after 20 seconds, blocked on "\\r\\n" when waiting for "Overwrite the previous NVRAM configuration\\\\?\\\\[confirm\\\\]" | "Warning: Saving this config to nvram may corrupt any network" | "Destination filename \\\\[\\\\S+\\\\][\\\\?]?\\\\s*$" | "\\\\A[^\\\\# ]+#[ ]?$"

If I'm reading the error message correctly, NSO was expecting to receive a line break from the device, but was instead prompted with the NVRAM question. I expect that this is desired NSO behavior, especially given the chance of corrupting any network as the message indicates. I wouldn't appreciate NSO automatically responding to this prompt and potentially causing an issue with the device configuration.

My only other concern with this is that during the rollback procedure, it appears to get the NVRAM prompt again, but silently ignores it. I don't think it writes the config at that point, but that might be okay since the previous config should have been written, and the device was rolled back to that and it just wasn't written.

We had submitted a TAC case to Cisco and were told to enable the service configuration-compress command on the devices to prevent the NVRAM prompt, but this does not appear to have had an effect. We're re-opening the case to see if we can get any other suggestions.

The reason I wanted to post on here is to see if anyone has dealt with a similar issue, start a discussion, and maybe get some other recommendations.

jmaruschak · ‎08-14-2020

This did end up being resolved. We actually involved Cisco TAC to investigate.

What was found is that the ASR920s are notorious for taking a very long time to write their configuration. The "read timeout after 20 seconds" in the error indicates that NSO was waiting to read output coming from the device after issuing the "write memory" command. After waiting 20 seconds NSO considered it a timeout.

TAC's recommendation was to increase the timeout. To resolve, we added the following configuration to all ASR920 devices in NSO:

jmaruschak@ncs# show running-config devices device asr920 read-timeout
devices device asr920
 read-timeout 600
!

This increases the timeout to 600 seconds. Without this command NSO will use the default value of 20 seconds.

View solution in original post

tsiemers1 · ‎08-13-2020

Bumping this up. Was this ever figured out? We are seeing the same problem on ASR920's and a couple of times it has caused the device CPU to go crazy and in one case crash the box. We are having problems where this pops up causing the box to roll back and since we are making routing changes the box crashed.

Did the 'service configuration-compress' ever end up working?

jmaruschak · ‎08-14-2020

This did end up being resolved. We actually involved Cisco TAC to investigate.

What was found is that the ASR920s are notorious for taking a very long time to write their configuration. The "read timeout after 20 seconds" in the error indicates that NSO was waiting to read output coming from the device after issuing the "write memory" command. After waiting 20 seconds NSO considered it a timeout.

TAC's recommendation was to increase the timeout. To resolve, we added the following configuration to all ASR920 devices in NSO:

jmaruschak@ncs# show running-config devices device asr920 read-timeout
devices device asr920
 read-timeout 600
!

This increases the timeout to 600 seconds. Without this command NSO will use the default value of 20 seconds.

tsiemers1 · ‎08-14-2020

NIce. Thanks for the reply. So is the NVRAM message a result of NSO rolling back the config? I will change the timeout and see how that goes.

jmaruschak · ‎08-14-2020

The timeout would have occurred first, and was the source of the error. The rollback would have been attempted following the timeout.

In fact, we could see in the trace logs that many times the error would occur again when issuing the write memory command during the rollback, but the second error was not presented to the user as far as we could tell.

Our original interpretation of the error message was incorrect. The device is not prompting for the NVRAM confirmation. It is not prompting at all because it takes longer than the default 20 seconds to finish the write memory. The lack of response is what is considered the timeout and causes the error and rollback attempt.