NSO deployment with HA and DR considered

Johan Nemitz · ‎11-17-2017

I am looking for recommendations on how to deploy NSO with the following requirements:

< 30k total devices

1-8 globally located clusters of devices (co-location facilities)

Deployment of NSO in one or more of the 1-8 locations

Single API point for deploying services across all devices/locations

Ability to still deploy/modify services even if 1 of the facilities used as a potential control point (NSO) goes offline.

What are the service designs supported by the recommended deployment.

I have not been able to find documentation to base my decisions.

Thanks in advance.

alam.bilal · ‎11-20-2017

Hi there,

Checkout the LSA documentation that comes with the NSO installation (for example nso_lsa-4.5.pdf). Also check out "deployment_guide.pdf" that comes bundled with the tailf-hcc package (describes the NSO HA framework).

Thanks,

Bilal.

Johan Nemitz · ‎11-21-2017

Thanks Bilal,

I have read both of those documents which covers how to deploy NSO and how to take advantage of LSA and the deployment model needed when using LSA. What I think is missing is when to use the various deployment options.

Questions I need to answer are:

When are there too many (NSO scale) or too geographically dispersed (timeouts from NSO to devices issue) devices for a single NSO installation (non-LSA) to no longer be effective?
If an LSA based deployment is necessary, how does one handle the situation when the location where the Service Layer NSO HA pair goes offline?
Would it ever be sensible to every split up an NSO HA pair between locations?
Should NSO clustering ever be used? If so, when?

Thanks,

Johan

alam.bilal · ‎11-24-2017

Hi,

All good questions. I know various folks have discussed these but don't think there is a guide as such. The information is available but dispersed. Perhaps we can request the Cisco-AS team to do a "guide" write-up, given that they have done majority of the real-life deployments. Also engineering have done some simulations too.

I've added my thoughts below.

When it comes to scale, the dimensions to consider are:

1.

Memory as all configs are stored in the in-memory CDB

- The number of devices and the average size of the device configuration

- The number of service instances and the average size of the service configuration

- Mark this up to account for metadata (back-pointers, etc). That should give the required memory.

2.

CPU processing

The incoming rate of Move-Add-Change (MAC) requests

- The complexity of service mapping logic for each of the service-types

- Transactional approach or asycn commit-queues

- I've seen some testing done 3+ years back with NETSIMs and commit-queue. With CPUs back then the throughput was around 14 changes per seconds for a VPN like service

- The service types could be split across multiple NSOs (CFS-NODEs) to spread the incoming MACs.

I guess every scenario is so different and one sizing may not work for all. BTW, I've heard the rule-of-thumb of scale-out with LSA with 20k-30k devices per NSO-RFS-instance.