cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
30534
Views
13
Helpful
2
Replies

How does NSO handle rollback?

Mark Swanborough
Cisco Employee
Cisco Employee

A common query from NSO developers and users is about rollback, as different systems handle it differently.

E.g. If a change is made, then a rollback is needed, does NSO perform the equivalent “no” commands to remove the config, or does a pre-change snapshot of the config “merge with” or “replace” the new config with the initial change implemented.

1 Accepted Solution

Accepted Solutions

Mark Swanborough
Cisco Employee
Cisco Employee

When answering this question, I think it’s helpful to understand there are a few different concepts at work in NSO. Unfortunately “rollback” gets used generically to mean a whole number of different things.

Fundamental: Transactions

NSO transactions are distributed (or multi-phase). An analogy would be in arranging a meeting. Instead of just saying straight away “come to this meeting” (and then you’re the only one that shows up - let's try again!), it’s more like asking everyone “can you make this meeting?” (prepare) and only once everyone responds “yes”, does NSO say “come to this meeting” (apply).

CDB Rollback

To keep this example simple, let’s pick something that doesn’t result in a change anywhere outside of the CDB: a NED setting.

When we request a change to a NED setting, via the CLI, a transaction is created containing the change that we eventually commit. Your CLI session now has something like a "candidate" view of the CDB - as if your change had already been applied. When we commit, the CDB validates that it is an acceptable change set (“prepare”s it). If it’s not valid, you get an error (e.g. you might have specified an invalid value), and the change is not applied – the “running” CDB was never touched (just your "candidate" view was). If it’s valid, then it gets applied to the running CDB. During that apply, the reverse action is calculated and stored in a rollback file on the NSO filesystem. If at some time later, you want to revert the change, you can go and identify the commit, and request the selective rollback of that commit, or rollback that commit and ALL changes since that commit. By default, NSO stores 500 of these rollback files. (See: NSO User Guide - Modifying the configuration - ~pg11 for details of viewing and applying rollbacks).

The CDB rollback should be used carefully by an end user, because subsequent transactions since the rollback was saved may have changed values that are not meant to be rolled back. Hence the existence of the Service Manager. However, rollbacks are still important to revert a change to a service.

NED “Rollback” (a.k.a. revert changes in a device)

Let’s make this example a bit more complex. Let’s make three changes to the CDB, all changes to the devices tree for three separate devices. The type of change is not really relevant. The changes are made via the NSO CLI, a transaction is created containing the changes – as before. When you commit, the device manager creates a nested transaction for each device being changed within the device tree. Your CDB transaction  includes these sub transactions - and  each one now needs to be successfully “prepared”. Whilst the device manager is coordinating these nested transactions, it falls to the NEDs to actually carry out these transaction phases, and we have a few different types of NED.

In this example, Device 1 is a netconf device. Device 2 is a CLI NED for an OS that supports single phase transactions, and 3 is a non-transactional CLI NED (for more information, see the NED Developer Guide, which outlines the below mechanics in more detail).

During prepare:

  • Device 1 will be issued the new config, and a prepare command issued to the device. In this instance, prepare works just fine, and the NED can return a successful prepare.
  • Device 2 is a given the commands, and a commit would be issued (no prepare is supported for single phase transactions), and that commit works OK (yes - a commit - unlike device 1, changes could now be affecting network traffic!)
  • Device 3 is sent the CLI commands for change, one by one.

However, on Device 3, something fails in one of the final commands - what happens now?

  • Device 3’s NED doesn't really know what failed/caused the failure! This is where NSO is different to other network management systems - it doesn't bluntly apply a pre-change snapshot of the device config. The normal behavior is for NSO to pull the running configuration from the device (as-is), pull the NSO view of the device before the commit (the as-was view, from the running CDB), and calculate the commands/changes required to bring the configuration back to its pre-change (as-was) state. Ultimately this change should be the reverse diffs of all the commands that were successfully applied on the device. Is this the same as "executing the no commands in reserve”? Not quite, but if the NED developer has done their job, it cleanly reverts the config.
  • Device 2’s NED would request a rollback of the transaction that was previously committed, reverting the change.
  • Device 1 would receive an “abort” for the current pending transaction from the NED, and nothing would actually be saved to the device’s configuration.

Special case: commit dry-run

A commit dry-run issues the prepare request... as we can see, in the prepare phase on CLI devices, we would normally push the commands – that’s not exactly a dry run!

The prepare request, issued to the Device Manager, for a dry-run is special, and the Device Manager prepares the expected changes exactly as it normally would, but doesn't "prepare" via the NED. The NED framework issues a special prepareDry command to the NED, containing the expected CLI commands, and it is up to the NED code to determine if any manipulation needs to be done to the CLI generated by the Device Manager, to produce accurate dry-run results.

Special case: commit commit-queue
When the commit queue is used, things get a bit tricky with reverting device level issues.

In NSO pre-4.4:

The challenge is, that in NSO pre-4.4, as described above, the Device Manager/NED needs access to the the NSO view of the device before the commit (as-was), to calculate the reverse diff (as in Device 3 above). When a commit queue is used, the current "running" view in the CDB will not represent that - the "candidate" view (the to-be config) - was already written, so now it represents the to-be view, possibly with other changes applied.

If a commit queue was used in the above example, the commands to Device 3 would not correctly be reverted - potentially requiring a manual intervention to revert the changes to the device.

In NSO 4.4 and later:

A new mechanism has been added that means this is no longer such a special case. Please see the NSO CHANGE log for 4.4, and search for snapshot for full details. Essentially the CDB keeps a snapshot view of the device before each transaction entry in the queue (as-was), and uses that as-is view to calculate the rollback.

Services “Rollback”

Thirdly, we have Services. Services manage their own “rollback”s, via Fastmap. No rollback files are used in the below (unless you decided to invoke a rollback of a commit that added a service, which would infact issue a service delete).

It's helpful if we were all a bit more accurate when discussing “rollback” in the context of services, because there are a few discrete pieces:

  • Create failure “rollback”
    • The multi-phase transaction behavior, described in NED Rollback above, will happen.
      • No Fastmap Reverse Diff is used in this rollback.
  • Modify failure “rollback”
    • Scenario:
      • The minimal diff is calculated. The complete Fastmap Reserve Diff is recalculated.
      • The Service is pushing only a minimal diff to the CDB.
      • The minimal diff changes are pushed, but failure occurs
    • The NEDs will rollback just the minimal diff, as described in NED Rollback above, and the current NSO CDB transaction will be aborted, automatically reverting the service config data (including the Fastmap Reverse Diff) to the state before the change
    • No Fastmap Reverse Diff used in this rollback.
  • Redeploy failure “rollback”
    • As per modify failure “rollback”
  • Undeploy
    • The Fastmap Reverse Diff is applied to everything, except for the actual service.
  • Delete
    • The Fastmap Reverse Diff is applied to everything.
  • Delete No-Networking
    • The Fastmap Reverse Diff is applied to everything. Device model changes do not invoke the NEDs.
  • Rolling back a successful Modify
    • (e.g. a modify action has resulted in a undesirable change in the network, and config should be reverted).
    • This can be achieved by applying the CDB Rollback for the modify, which would prompt a re-deploy of the services with their prior parameters.

Reactive “Rollback”
Finally, this all gets a bit more complex with Reactive Fastmap, which might need to put in place its own delete “workflow” to ensure the ordering of the “delete” issued to the underlying pieces are correct. This could be via CDB subscriber/kicker, or some sort of Nano Service.

View solution in original post

2 Replies 2

Mark Swanborough
Cisco Employee
Cisco Employee

When answering this question, I think it’s helpful to understand there are a few different concepts at work in NSO. Unfortunately “rollback” gets used generically to mean a whole number of different things.

Fundamental: Transactions

NSO transactions are distributed (or multi-phase). An analogy would be in arranging a meeting. Instead of just saying straight away “come to this meeting” (and then you’re the only one that shows up - let's try again!), it’s more like asking everyone “can you make this meeting?” (prepare) and only once everyone responds “yes”, does NSO say “come to this meeting” (apply).

CDB Rollback

To keep this example simple, let’s pick something that doesn’t result in a change anywhere outside of the CDB: a NED setting.

When we request a change to a NED setting, via the CLI, a transaction is created containing the change that we eventually commit. Your CLI session now has something like a "candidate" view of the CDB - as if your change had already been applied. When we commit, the CDB validates that it is an acceptable change set (“prepare”s it). If it’s not valid, you get an error (e.g. you might have specified an invalid value), and the change is not applied – the “running” CDB was never touched (just your "candidate" view was). If it’s valid, then it gets applied to the running CDB. During that apply, the reverse action is calculated and stored in a rollback file on the NSO filesystem. If at some time later, you want to revert the change, you can go and identify the commit, and request the selective rollback of that commit, or rollback that commit and ALL changes since that commit. By default, NSO stores 500 of these rollback files. (See: NSO User Guide - Modifying the configuration - ~pg11 for details of viewing and applying rollbacks).

The CDB rollback should be used carefully by an end user, because subsequent transactions since the rollback was saved may have changed values that are not meant to be rolled back. Hence the existence of the Service Manager. However, rollbacks are still important to revert a change to a service.

NED “Rollback” (a.k.a. revert changes in a device)

Let’s make this example a bit more complex. Let’s make three changes to the CDB, all changes to the devices tree for three separate devices. The type of change is not really relevant. The changes are made via the NSO CLI, a transaction is created containing the changes – as before. When you commit, the device manager creates a nested transaction for each device being changed within the device tree. Your CDB transaction  includes these sub transactions - and  each one now needs to be successfully “prepared”. Whilst the device manager is coordinating these nested transactions, it falls to the NEDs to actually carry out these transaction phases, and we have a few different types of NED.

In this example, Device 1 is a netconf device. Device 2 is a CLI NED for an OS that supports single phase transactions, and 3 is a non-transactional CLI NED (for more information, see the NED Developer Guide, which outlines the below mechanics in more detail).

During prepare:

  • Device 1 will be issued the new config, and a prepare command issued to the device. In this instance, prepare works just fine, and the NED can return a successful prepare.
  • Device 2 is a given the commands, and a commit would be issued (no prepare is supported for single phase transactions), and that commit works OK (yes - a commit - unlike device 1, changes could now be affecting network traffic!)
  • Device 3 is sent the CLI commands for change, one by one.

However, on Device 3, something fails in one of the final commands - what happens now?

  • Device 3’s NED doesn't really know what failed/caused the failure! This is where NSO is different to other network management systems - it doesn't bluntly apply a pre-change snapshot of the device config. The normal behavior is for NSO to pull the running configuration from the device (as-is), pull the NSO view of the device before the commit (the as-was view, from the running CDB), and calculate the commands/changes required to bring the configuration back to its pre-change (as-was) state. Ultimately this change should be the reverse diffs of all the commands that were successfully applied on the device. Is this the same as "executing the no commands in reserve”? Not quite, but if the NED developer has done their job, it cleanly reverts the config.
  • Device 2’s NED would request a rollback of the transaction that was previously committed, reverting the change.
  • Device 1 would receive an “abort” for the current pending transaction from the NED, and nothing would actually be saved to the device’s configuration.

Special case: commit dry-run

A commit dry-run issues the prepare request... as we can see, in the prepare phase on CLI devices, we would normally push the commands – that’s not exactly a dry run!

The prepare request, issued to the Device Manager, for a dry-run is special, and the Device Manager prepares the expected changes exactly as it normally would, but doesn't "prepare" via the NED. The NED framework issues a special prepareDry command to the NED, containing the expected CLI commands, and it is up to the NED code to determine if any manipulation needs to be done to the CLI generated by the Device Manager, to produce accurate dry-run results.

Special case: commit commit-queue
When the commit queue is used, things get a bit tricky with reverting device level issues.

In NSO pre-4.4:

The challenge is, that in NSO pre-4.4, as described above, the Device Manager/NED needs access to the the NSO view of the device before the commit (as-was), to calculate the reverse diff (as in Device 3 above). When a commit queue is used, the current "running" view in the CDB will not represent that - the "candidate" view (the to-be config) - was already written, so now it represents the to-be view, possibly with other changes applied.

If a commit queue was used in the above example, the commands to Device 3 would not correctly be reverted - potentially requiring a manual intervention to revert the changes to the device.

In NSO 4.4 and later:

A new mechanism has been added that means this is no longer such a special case. Please see the NSO CHANGE log for 4.4, and search for snapshot for full details. Essentially the CDB keeps a snapshot view of the device before each transaction entry in the queue (as-was), and uses that as-is view to calculate the rollback.

Services “Rollback”

Thirdly, we have Services. Services manage their own “rollback”s, via Fastmap. No rollback files are used in the below (unless you decided to invoke a rollback of a commit that added a service, which would infact issue a service delete).

It's helpful if we were all a bit more accurate when discussing “rollback” in the context of services, because there are a few discrete pieces:

  • Create failure “rollback”
    • The multi-phase transaction behavior, described in NED Rollback above, will happen.
      • No Fastmap Reverse Diff is used in this rollback.
  • Modify failure “rollback”
    • Scenario:
      • The minimal diff is calculated. The complete Fastmap Reserve Diff is recalculated.
      • The Service is pushing only a minimal diff to the CDB.
      • The minimal diff changes are pushed, but failure occurs
    • The NEDs will rollback just the minimal diff, as described in NED Rollback above, and the current NSO CDB transaction will be aborted, automatically reverting the service config data (including the Fastmap Reverse Diff) to the state before the change
    • No Fastmap Reverse Diff used in this rollback.
  • Redeploy failure “rollback”
    • As per modify failure “rollback”
  • Undeploy
    • The Fastmap Reverse Diff is applied to everything, except for the actual service.
  • Delete
    • The Fastmap Reverse Diff is applied to everything.
  • Delete No-Networking
    • The Fastmap Reverse Diff is applied to everything. Device model changes do not invoke the NEDs.
  • Rolling back a successful Modify
    • (e.g. a modify action has resulted in a undesirable change in the network, and config should be reverted).
    • This can be achieved by applying the CDB Rollback for the modify, which would prompt a re-deploy of the services with their prior parameters.

Reactive “Rollback”
Finally, this all gets a bit more complex with Reactive Fastmap, which might need to put in place its own delete “workflow” to ensure the ordering of the “delete” issued to the underlying pieces are correct. This could be via CDB subscriber/kicker, or some sort of Nano Service.

It is worth to notice that for a CLI NED some specific NED settings are possible to move the confirmation out of the prepare state into commit state. That is introduced for example with IOS-XR NED 6.0.1 where support for confirmed-in-commit method with the cisco-iosxr-commit-method ned-setting was added. What that does is following:

    enum confirmed-in-commit {
      tailf:info "Same as confirmed method except that 'commit confirmed' is"
        +" called in NSO commit phase instead of prepare phase. Hence later"
        +", which results in a shorter time until the commit is confirmed";
    }