Re: Using NSO to make changes involving loss of management connectivit

snovello · ‎10-05-2022

Recently I had to use NSO to make configuration changes on a device that would cause connecivity to the management IP to be lost. This is a set of notes to document how to deal with this case. In my case I was using an IOS-XR based device and the CLI NED. Of course a lot of the detail depends on the device type.

Normally you would always try and avoid losing connectivity to management because if things go wrong you have an unmanageable device. However there are some cases where there is really no way to avoid it.

In this case we had an access device and its neighbor. Management connectivity to the access device was in-band via the neighbor. Initially you have a static routes and you want to switch to using an IGP, keeping the same management IP. I did not look deeply into ways of avoiding connectivity loss because the context was running tests exactly for the purpose of seeing how NSO can handle this loss of connectivity use case.

The first thing with NSO was to design a service that would configure the IP link using the IGP. The templates of this service also contained the instructions to delete the static routes that exist before. NSO's default behavior is to configure both devices in parallel in a single transaction. In our case we wanted a precise sequence. First configure the access device, losing connectivty, then configure its neighbor to recreate connectivity. For this purpose I added a 'stage' leaf to the service . The template of the service would configure nothing in default stage 'init', it would configure only the access device in stage 'access-only' and would configure both access and neighbor in stage 'full'. With that I could test the templates with dry runs to ensure the correct config was being generated.

I defined a python action to take the service through the various stages. In NSO there is the nano service construct to declaratively define a service that has to go through various stages of configuration but here I knew we would have to have different options associated with different commits, and so I decided to use an action to get that precise control.

With that in place we could start with determining settings to ensure we could do a commit on the access device, disabling the rollbacks that happen on any error. The first issue is that NSO uses a 'confirmed commit' by default. So when commiting, NSO sends the command 'confirmed commit <timeout seconds>' which makes the changes on the device, but needs to be confirmed within the timeout limit. If that confirmation does not come the device rolls back the changes. When the changes are applied of course you lose management connectivity and are not able to confirm them. For this there is a ‘ned-setting cisco-iosxr write commit-method’ which can take the value ‘confirmed’or ‘direct’. By setting it to ‘direct’ the NED simply commits with ‘commit’ with no options. It would roll-back any failed changes using an explicit rollback command. In our case if the changes are accepted the ssh session ends immediately due to the loss of IP connectivity.

The next issue is that a device session ending would normally cause NSO to attempt to roll back the whole transaction. This is avoided by using the commit-queue feature with the continue-on-error flag. What this does is the transaction is written to the NSO CDB first. Subsequently the changes are made on the device. If there is an error the transaction is kept as is without rolling back.

After committing we must wait for the commit queue to be processed, we can wait on the queue item to be completed with the ‘devices commit-queue queue-item <id> wait-until-completed’ action. It is not enough to just use the ‘commit commit-queue sync’ since this returns once the device interaction is over, but the commit-queue item is kept after that for a while and can cause subsequent changes involving the device to fail. We did not check why but the queue item is probably there to attempt to reconnect and roll back the change. An alternative we did not try but probably also works was to clear the queue item after the the ‘commit commit-queue sync’ returns.

Once the access device is configured you can configure the neighbor by setting the ‘stage’ to full. This is just a normal commit and will only make changes on the neigbour. After that commit you will need to wait for the IGP to come up on the link before you have management connectivity. You can use the ‘devices device <name> ping’ action to wait for connectivity to be re-established.

Finally the access device needs to be synced using the ‘sync-from’ action. This will find no differences but will update the commit id.

Below is the python code for the action.

import ncs
from ncs.application import Service
from ncs.dp import Action
import traceback

def set_commit_method(trans, devicename, method):
    root = ncs.maagic.get_root(trans)
    root.devices.device[devicename].ned_settings.cisco_iosxr.write.commit_method = method

def set_admin_state(trans, devicename, state):
    root = ncs.maagic.get_root(trans)
    root.devices.device[devicename].state.admin_state = state

def set_stage(trans, kp, stage):
    service = ncs.maagic.get_node(trans, kp)
    service.stage = stage
# ------------------------
# SERVICE CALLBACK EXAMPLE
# ------------------------
class StaticToIsis(Action):
    .action
    def cb_action(self, uinfo, name, kp, input, output, trans):
        self.log.info('action name: ', name)
        ping_worked = False
        try:
            with ncs.maapi.single_read_trans('admin', 'python') as trans:
                service = ncs.maagic.get_node(trans, kp)
                assert service.stage == 'init', f'Action {name}: service needs to be in init stage'
                access_device = service.access_device.hostname
                neighbor_device = service.neighbor_device.hostname
        except Exception as e:
            output.success = False
            output.result = str(e)
            self.log.info(traceback.format_exc())
            return
        try:
            with ncs.maapi.single_read_trans('admin', 'python') as trans:
                root = ncs.maagic.get_root(trans)
                check_sync_output = root.devices.device[access_device].check_sync()
                assert check_sync_output.result == 'in-sync', (
                    f'{name}:{access_device} Access device is out of sync' )
                check_sync_output = root.devices.device[neighbor_device].check_sync()
                assert check_sync_output.result == 'in-sync', ( 
                    f'{name}:{neighbor_device} Neighbor device is out of sync' )
            with ncs.maapi.single_write_trans('admin', 'python') as trans:
                set_commit_method(trans, access_device, 'direct')
                set_stage(trans, kp, 'access-only')
                params = trans.get_params()
                params.commit_queue_sync(timeout = 30)
                params.commit_queue_error_option('continue-on-error')
                params.commit_queue_tag(f'{name}:access_device={access_device}')
                trans_result = trans.apply_params(keep_open=False, params=params)
                self.log.info(f'{name}:{access_device} Commit Queue Transaction Result:{repr(trans_result)}')
            with ncs.maapi.single_write_trans('admin', 'python') as trans:
                set_admin_state(trans, access_device, 'southbound-locked')
                set_stage(trans, kp, 'full')
                trans.apply()
                self.log.info(f'{name}:{access_device} Neigbor device configured')
            with ncs.maapi.single_write_trans('admin', 'python') as trans:
                set_admin_state(trans, access_device, 'unlocked')
                trans.apply()
                self.log.info(f'{name}:{access_device} Access device unlocked')
            with ncs.maapi.single_read_trans('admin', 'python') as trans:
                root = ncs.maagic.get_root(trans)
                pings = 20              
                for i in range(pings):
                    # each ping has 10s timeout
                    result = root.devices.device[access_device].ping().result
                    self.log.info(f'{name}:{access_device} {result}')
                    if -1 != result.find('1 received'):
                        ping_worked = True
                        break
                assert ping_worked, f'Could not reach Access device after {pings} 10s pings'
                sync_from = root.devices.device[access_device].sync_from
                sync_from_input = sync_from.get_input()
                sync_from_input.dry_run.create()
                sync_from_input.dry_run.outformat = 'cli'
                sync_from_output = sync_from(sync_from_input)
                assert sync_from_output.cli == '', (
                    f'Action {name}:Access Device={access_device}'
                    f' sync_from dry_run had unexpected differences\n{sync_from_output.cli}' )
                sync_from_output = sync_from()
                assert sync_from_output.result, ( 
                    f'Action {name}:Access Device={access_device}'
                    f' sync_from failed: {sync_from_output.info}' )
                self.log.info(f'{name}:{access_device} access_device sync from done')
        except Exception as e:
            output.success = False
            output.result = str(e)
            self.log.info(traceback.format_exc())
            return
        finally:
            with ncs.maapi.single_write_trans('admin', 'python') as trans:
                set_commit_method(trans, access_device, 'confirmed')
                set_admin_state(trans, access_device, 'unlocked')
                service = ncs.maagic.get_node(trans, kp)
                if not ping_worked and service.stage == 'full':
                        service_stage = 'access-only'
                trans.apply()
        output.success = True
        return

In addition to what is described above we have a few checks to ensure robustness, and we place the access device into ‘southbound-locked’ admin state while we know it is unreachable. This is not stricty necessary but would avoid NSO blocking in any unforeseen circumstance if it attempts to access that device.

Alexander Stevenson · ‎11-01-2022

Nice post; useful and well-written

snovello · ‎11-08-2022

Thanks Alex!

Using NSO to make changes involving loss of management connectivity