Apologies if this question has already been answered...
Recently I started noticing the proliferation of YANG models for assurance type information - particularly operational data. For instance the work in the BBF for G.FAST NETCONF/YANG support (WT-355).
Is there a plan to have NSO to capture and manipulate this type of data? I understand the use case of "monitoring" a couple KPIs for reactive fastmap. However, how does NCS accommodate for large scale data collection?
Thanks for sharing your thoughts,
None! New territory to explore! Funny discussed
same topic earlier with Benoit Claise and Carl Moberg on a separate call!
Actually, some work has been done in this area with NSO. Not a complete service assurance solution in the broad sense of the term, but rather a lean, service-centric analytics approach that leverages NSO’s real-time knowledge of which services have been provisioned on the platform. Watch the video presentation of a recent ETSI POC in the attached email. I think there is a lot of potential here.
There was an interesting related presentation at the @Scale conference posted earlier in 2015 entitled "Extending SDN to the Management Plane", presented by Anees Shaikh of Google. In it, he specifically talks about the benefits of using YANG models for both infrastructure configuration and service assurance systems. Specifically, he says that Google sees advantages in using the same schema across both provisioning and assurance systems because it becomes easier to correlate Intended configuration state with monitoring state.
Here’s the video of the presentation, starting at the slide on service assurance using YANG models:
He goes on to talk about the collection of streaming telemetry data, which in my mind made the idea of a “Prime Analytics engine + YANG assurance model” system quite appealing.
The basic idea is to "orchestrate" some probes/monitors/xxx on a per service (vpn) basis.
In CVPN we're planning to do a really light variation on this very same theme.
We plan to fire up a small VM per VPN which runs ConfD and some simple monitoring sw. This VM will run on the inside network, and thus has access to - well - the inside.
It can just do some easy things like ping inside IP on all CPEs, try to HTTP get some porn and make sure it's redirected - or whatever.
Very pragmatic, and very easy.
Probably much much more correct, than continuously do SNMP sweeps and try to understand the result. Also more scalable since it's automatically distributed. The confd can send netconf notifs when bad stuff happens. I.e events, not pull from p.o.v of NSO
I think these ideas originate from Stefan Wallin, and I like it.
And as more and more SPs drive to minimize cost out of the network, one of the big targets are the “probe vendors”. SPs are now requesting to have the NFs themselves to provide the probe functionality “for free", thus eliminating the extra cost of purchasing probes and more importantly reducing the complexity of managing an ever large number of expensive probes.
The CVPN approach explained below go toward the same direction by using “light probes”. In addition, the point you mentioned about scalability and efficiency of existing methods such as SNMP need some attention as well. Besides the use of NETCONF notifications, there is an emergence of work on defining YANG versions of SNMP MIBs including IP/ICMP stats (https://tools.ietf.org/html/draft-baill-netmod-yang-ip-stats-01). As you are aware, SNMP is not very efficient for collecting data and the bulk data transfer mechanisms could make use of a “standard” representation schema beyond the traditional XML tags when available.
Does that make sense?
I think the assurance (=fault+performance) in this context is deeper than than sneaking under the vm container, do a ping and report.
In the context of an nfv, we have multiple layers of assurance: the vm/hypervisor health environmentals we all know, then the intrinsic aspects of the service OAMs within the nfv-nfv service path assurance aspects. Then correlate the whole cocktail. Not trivial by any means due to multiple containment levels.
So creating a yang model of all of this to distill this STATE data really is paramount. I dont know of anyone eho cracked the code. Few startups are hacking their way into this. Cvg doing their thing but I just do not know ...
“Render unto Caesar the things that are Caesar's” J.
The discussion Service Assurance Orchestration concept came from the Service Assurance workshop that has indeed bring to the current Service Assurance architecture.
The integration with orchestration is a key differentiator and it is exploited in the following way:
Sharing the service model to enable RCI and SIA (including changes of the model itself)
Use or the orchestration to make the service ready assurable by correctly instrumenting the infrastructure (we call it Orchestration of Service Assurance)
Loop-back feedback from assurance to orchestration
Stefan and the orchestration team are active contributor in making those concept a reality.
Hmmm, this sounds low-level to me. Isn't it better to just focus on the service - as seen from the p.o.v of the customer and try to assess that through e.g probes.
Once that fails, deeper analysis can start.
I.e customer buys VPN with certain qos aspects, as long as those qos aspects are fulfilled from customer nw, we're good.
If not, we're bad.
All other types of analysis leads to a lot of false positives.
Fully agree. What matters most is assurance of the service being delivered to the end user (at least that part of the service within NSO scope). With the ever-increasing move toward all/part virtualization of service production, there will be a corresponding increase in self-healing infrastructure - making lower level events less and less relevant to the end user service.
Folks, this is a very timely and relevant topic for sure!
Where can I find more information on the NetRounds Probes I see referenced in the video? I assume these are all virtual.
* I see the probes in this environment also providing critical Location data to the service topology beyond just SLA KPI/KQI
* I see that these probes are active monitoring.
* Hope steps are taken to be sure Probes act as bump-in-the-wire and the act of monitoring does not affect performance.
* I assume these probes can be multi-tenant? My concern is in a large scale DC service deployment, the additional Compute requirements of the probes and performance impact on hypervisor.
I have designed numerous SP monitoring architectures in the past that included various probe technologies. If I may suggest, the northbound integration to the OSS/BSS systems will be critical. Each provider I dealt with start by telling you that, ³I need my service monitored.² Once you achieve that (e.g. Your vPN service is violating SLA) the next immediate question is ³Why?² The next question after that will be, ³How can I achieve RCA?²
So the northbound alerts need to be as intelligent as possible and export relevant Location/Service/Tenant/Provider information as possible. I would expect systems monitoring the overall architecture and infrastructure to pick up this information and attempt to correlate to the logical RCA.
In an ideal demo you would pull the network connection in your DC switch or knock out a compute:
* Probes detect Service Anomaly on affect services at affected locations
* Infrastructure Assurance System detects Network/Compute Outage and Correlates Probe data on top
* Analytic Correlation Engine is able to provide heuristics that tie HW event to Service Anomaly.
* Customer weeps and writes big check.
How does one get tied more into this work?
On 7/9/15, 7:46 AM, "Hal Gurley (hgurley)" <email@example.com> wrote:
Let's definitely talk. Seems there are a number of disparate threads on this topic. I've been talking with the Netrounds guys for a few months now about possible integration on the Routing platforms but it ties back into this larger story.
I called out this specific line in your reply. Verizon has asked us to automate the 'service assurance adjacency definitions' in VMS because we already know the correlation of components that we have orchestrated. Today, we could only send them raw information from the platforms, nodes, systems, machines. They are okay with that, but the raw feeds have no context.
The service model actually has context east-west such as adjacencies between devices (e.g. ipsec tunnels) as well as north-south such as logical attachment stacks (e.g. vnf -> vm -> node).
Their basic question is how do WE provide contextual constructs in the service assurance model that allows them to 'auto-correlate' events.
They manage over 400,000 unique devices and it's not conceivable that our platform would receive all events with which we can correlate to our service model. Their current thinking is that we would use VMS to manage "CPE". They would provide us 'attachment points' to the existing network infrastructure (e.g. MPLS VPN) through their existing orchestration processes. Our VMS would manage the CPE attachment to the PE, build the overlay VPNs, and attach VNF to the overlay VPN. Our service models have all the contextual relationships for the overlay while their systems get all the raw feeds.
How do they avoid manually building correlation in their event collector engines?
Can we use an auto-generated YANG service assurance model that
a. their collectors can query and 'discover' the correlation
b. we can export the YANG service assurance model that they can reference to do auto-correlation
Thanks Scott, we were hit by same request from Comcast and vMS. As per my earlier email , the various layer assurance params involved are tightly coupled making it a challenge to provide SLAs.
One particularly interesting Sla koi is availability.
In the context of vnf to vnf, it is not clear to me how you do this: a vnf (vm) choke for what ver reason, then a back up one get fired up to inherit the same "profile" and so on, etc at any rate, there is more work to be done in this area ... Few startups and 1/2 ones are looking at this ... ramifications for cloud services are big. Amazon, FB AND Google i am certain have developed their own assurance apps.