Question on NSO from potential customer

previousqna · ‎05-10-2017

Hello NSO experts,

My customer had a question re: NSO and rollbacks that I could not answer definitively. Appreciate your help

The question scenario I was asking earlier today was basically regarding how the rollback feature works. If a change is made, then a rollback is needed, does NSO perform the equivalent “no” commands to remove the config, or does a pre-change snapshot of the config “merge with” or “replace” the new config with the initial change implemented.

previousqna · ‎05-10-2017

The way I understand NSO is that “rollback” is handled by FASTMAP algorithm – as it is very specific version of “change” that deletes the service. Basically, NSO compares two models: the one that is CURRENT with the one that represents TO BE state – makes a diff and sends only “delta” to device.

If you’re not familiar with FASTMAP operation then please take a look onto NSO Development Guide where you can find very good explanation.

Here’s snip from that document:

previousqna · ‎05-10-2017

In simple words rollback will bring the device to previous pre-commit stage. Below is simple example

1. Device “D" has interface “I" with description “test”

2. Using NSO you changed this description to “second test”.

3. Once you use rollback your commit, it will bring that device description back to “test".

previousqna · ‎05-10-2017

Thank you. I had noticed in the IOS Ned that it seems to understand defaults that may not show up in configs such as default "no ip domain-lookup" - assume that's part of the NED?

previousqna · ‎05-10-2017

Probably the best thing to do in order to understand rollbacks is to setup a NCS local install with an IOS netsim and play around with it.

Set something, then rollback… check what the dry-run output is.

previousqna · ‎05-10-2017

From an internal perspective, NSO will record the rollback action before it attempts to apply the configuration changes.

As an example consider the following:

A very simple service requires to change the SNMP community string.
- Service requests the string to be set to ‘New Value’
- NSO will check the device configuration as stored in its internal CDB
- NSO will record the rollback action as follows:
  - If no current string is set, NSO will record a ‘no’ command
  - If a string is currently set, NSO will record the current value on the device, as the rollback action
- NSO applies the configuration changes (setting the string to ‘New Value’, the rollback action is, as stated, stored in the CDB based on the previous string setting

In essence, before NSO makes any changes to any device, it compares the requested change, with the current device configuration and derives the rollback configuration. This is performed by NSO’s FASTMAP module / algorithms.

Hope this helps.

previousqna · ‎05-10-2017

Folks

Trying to understand the power of NSO rollback and also refcount for example. One for the weekend…

How does NSO behave in a scenario like below:

Service A is provisioned, which sets a string X to “active”. Before that, the value of X is “none”
Service B is provisioned, which sets a string X to “active”. Before that, the value of X is already “active” due to (1) above.
Rollback Service A

What will be the value of X? Will it stay as “active” or go back to “none”?

Rollback Service B

What will be the value of X? Will it stay as “active” or go back to “none”?

previousqna · ‎05-10-2017

The value of X will stay as “active” as NSO will maintain a ref count. So by default all set operations in an RFS context are “shared” and NSO will maintain ref counts. If service B is rolled back then X will return to its prior state of “None”

previousqna · ‎05-10-2017

Thanks, [NAME]. That is so COOL.

previousqna · ‎05-10-2017

I think it’s helpful for people to understand there are a few different concepts at play here, because we tend to use “rollback” generically to mean a whole number of different things.

The below is accurate to the best of my understanding as a non-engineering NSO user – there are some more factors at play here, and I’d appreciate any corrections/suggestions for clarifications!

Fundamental: Transactions

NSO transactions are multi-phase or distributed. An analogy would be in arranging a meeting. Instead of just saying straight away “come to this meeting” (and then you’re the only one that shows up), it’s more like asking everyone “can you make this meeting?” (prepare) and only once everyone responds “yes”, does NSO say “come to this meeting” (apply).

CDB Rollback

To keep this example simple, let’s pick something that doesn’t result in a change anywhere outside of the CDB: a NED setting.

When we request a change to a NED setting, via the CLI, a transaction is created containing the change that we eventually commit. When we commit the CDB validates that it is an acceptable transaction (“prepare”s it). If it’s not acceptable, the change gets thrown away – the “running” CDB was never touched (just your view was), and you get an error (e.g. you might have specified an invalid value). If it’s acceptable, then it gets applied. During that apply, the reverse action is calculated and stored in a rollback file on the NSO filesystem. If at some time later, you want to revert the change, you can go and identify the commit, and request the rollback. By default, NSO stores 500 of these rollback files.

The CDB rollback should be used carefully by an end user, because subsequent transactions since the rollback was saved may have changed values that are not meant to be rolled back. Hence the existence of the Service Manager.

NED “Rollback”

Let’s make this example a bit more complex. Let’s make three changes to the CDB, all changes to the devices tree for three separate devices. The type of change is not really relevant. The changes are made via the CLI, a transaction is created containing the changes – as before, however changes against the devices tree are actually treated as transactions managed by the device manager. The device manager creates a sub (?) transaction for each device being changed, and when the “commit” is issued in the CLI, that CDB transaction now also includes these sub transactions, that each need to be successfully “prepared”. It falls to the NEDs to actually carry out these transaction phases, and we have a few different types of NED.

In this example, Device 1 is a netconf device. Device 2 is a CLI NED for an OS that supports single phase transactions, and 3 is a non-transactional CLI NED. During prepare, device 1 will be issued the new config, and a prepare command issued to the device. In this instance, prepare works just fine. Device 2 is a given the commands, and a commit would be issued (no prepare is supported for single phase transactions), and that commit works OK (but unlike device 1, changes could now be affecting network traffic). Device 3 is sent the commands, one by one – but on this device, something fails in one of the final commands.

Device 3’s NED will start sending the inverse commands for the previously sent commands, in “reserve” sequence. This relies on the NED understanding the device specifics. Device 2’s NED would request a rollback of the transaction that was previously committed. Device 1 would receive an “abort” for the current pending transaction from the NED, and nothing would actually be saved to the device’s configuration.

Special case: commit dry-run

A commit dry-run essentially issues the prepare request to the NED. As we can see, in the prepare phase on CLI devices, we typically push the commands – that’s not exactly a dry run! Older NEDs actually did just that, and pushed commands in the case of a dry-run(!).

However, I believe most (if not all) of the CLI NEDs have now been changed to respect the dry-run flag, and will act without pushing device configuration for a dry-run.

Special case: commit commit-queue

When the commit queue is used, things get a bit tricky. Because the transaction is commit’ed immediately, when the transaction finally gets to be processed by the NED, the NED doesn’t have an ability to see both the current “running” view of the CDB, as well as the “candidate” view from the transaction.

However, the code used to calculate the reverse sequence of the commands to apply in the event of a failure, like in Device 3 above, currently needs access to both of these. If a commit queue was used in the above example, the commands to Device 3 would not correctly be reverted. Potentially requiring a manual intervention to revert the changes to the device (!).

(NED developers – feedback/corrections on this greatly appreciated – I’d love to know more about the commit queue challenges).

Services “Rollback”

Thirdly, we have Services. Services manage their own “rollback”s, via fastmap (not going to explain the whole fastmap). No rollback files are used in the below (unless you decided to invoke a rollback of a commit that added a service, which would infact issue a delete).

However, it would be helpful if we were all a bit more accurate when discussing “rollback” in the context of services, because there are a few discrete pieces:

Create failure “rollback”
- The multi-phase transaction behavior, described in NED Rollback above, will happen.
- No Fastmap Reverse Diff used in this rollback.
Modify failure “rollback”
- The minimal diff is calculated. The complete Fastmap Reserve Diff is recalculated.
- The Service is pushing only a minimal diff to the CDB.
- The minimal diff changes are pushed, but failure occurs, then the NEDs will rollback just the minimal diff, as per above, and the current NSO CDB transaction will be aborted, automatically reverting the service config data (including the Fastmap Reverse Diff) to the state before the change
- No Fastmap Reverse Diff used in this rollback.
Redeploy failure “rollback”
- As per modify failure “rollback”
Undeploy
- The Fastmap Reverse Diff is applied to everything, except for the actual service.
Delete
- The Fastmap Reverse Diff is applied to everything.
Delete No-Networking
- The Fastmap Reverse Diff is applied to everything. Device model changes do not invoke the NEDs.

Reactive “Rollback”

Finally, this all gets a bit more complex with Reactive Fastmap, which might need to put in place its own delete “workflow” to ensure the ordering of the “delete” issued to the underlying pieces are correct.

I hope that helps – I’m planning to put together some slides on these concepts in the near future.