cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

Switching From Email To Chat[Ops]

1210
Views
0
Helpful
0
Comments
Hall of Fame Cisco Employee

Last year at Cisco Live Melbourne I was tasked with building some monitoring and automation to help support the network infrastructure for the Network Operations Center (NOC).  This involved, among other things, getting Prime Infrastructure installed and configured.  As part of my ritual, I started to configure the email settings in order to send out notifications about things like devices going down, wireless problems, etc.  But to whom should I send these emails?

That question got me to pause.  Why send emails at all?  We now have this Spark thing.  It has an API, Prime Infrastructure has an API.  Why not merge the two?  Turns out this was a great idea.  By creating a Python script that extracted device states from PI and posted up/down notifications to Spark, I was able to get a lot of eyes looking at device reachability issues in near real-time.

This was a good start, but I learned a couple of lessons from this.  First and foremost, don't use your own account.  At the time, Spark didn't offer a bot infrastructure, but I still should have created a new account for these notifications.  The other lesson learned is that these types of automated messages shouldn't be mixed in the same room that has humans trying to have conversations.  It can get rather chatty.

Fast-forward to Cisco Live Berlin 2017.  Knowing that the Melbourne team generally liked the bot idea, I decided to take it further.  First, let's apply what we learned from 2016.  Since then, Spark added the concept of bot integrations.  Bots are designed specifically to handle these automated tasks.  They are machines that can either "fire and forget" messages to Spark rooms or handle interrogation from humans.  I created a bot specifically to handle all automated postings to Spark.

For the other 2016 lesson, I wanted to make sure all bot alerts when to dedicated rooms.  Fortunately, there was another new feature in 2017, Spark Teams.  Teams allow for an umbrella group under which you can create as many rooms as you want.  We created a team for our Cisco Live Berlin NOC group, and I spun up a number of rooms for bot alerts.

Just as in 2016, I had a room that showed devices going down and coming back up (again, using the Prime Infrastructure API).

But now that I had a taste for bots and this notion of using ChatOps to bring automated alerts to our human operators and engineers, I decided to get a bit crazy.  You can see some of the various "Alarms" groups from the screenshot above.  I had a room for core network changes including routing table changes and critical interfaces that were seeing errors or discards.

I had a room that would alert us if DHCP scopes were running a bit low on available addresses (and we definitely saw this with some of our lab areas).

I had a room for all things data center as well.  Not only did we use this room to report on VM health from vCenter, but our NetApp partners used this to send very detailed alerts about various issues related to storage and the SAN/NAS.

All of these rooms combined various API or integration points (in the case of vCenter) together with the Spark API for posting messages.  I used Python scripts and the Spark bot infrastructure to grab data from one system (e.g., Prime Infrastructure or Prime Network Registrar), and send it to Spark.

What was probably the most interesting integration, however, was the interactive bot.  One of the questions I received a lot was, "where is host so-and-so?"  That is, people were looking to see where a specific client was connected on the network, or get information about its DHCP lease.  In the case of wireless clients, they wanted to know a physical location within the venue.  In order to track all of this down, one would need to talk to Prime Network Registrar, Prime Infrastructure and Connected Mobile Experience (CMX).

Fortunately, we have an API for that :-).  I built an interactive bot using the Spark webhooks framework that could take a question from one of our operators or engineers and do the necessary look ups across all of these applications.  Another colleague of mine wrote a bridge app to CMX that was able to provide an image of where the user was in the network.  Put all of this together, and we have a powerful tool at all of our NOC users' fingertips that gives them instance client visibility.

Hopefully you're a little inspired by this in terms of what APIs can do and how you can use tools like Spark to think a bit differently about how you do your network alerts.  There is a lot of power in being able to put alerts directly in front of a collaborative group of users, allow them to discuss them in real time, and then work together to come up with a solution.  It sure beats sending out multiple emails and not really knowing who has seen which alert.

If you're interested in any of this code, I have it all post in my SVN repository.  Feel free to play with it and if you want, send back any suggestions or modifications.  Enjoy playing with and getting addicted to ChatOps!

CreatePlease to create content
Content for Community-Ad
July's Community Spotlight Awards