Howdy out there in automation land! And welcome to the first blog of 2020... I know its been too long without one, but we have had so many things going on and lots I want to write about. And of course, make videos about them... so I just had to pick one thing and off we go. Today we are going to show a great example of collaboration across our Customer Experience (CX) organization! But first... a movie poster to go with this? Well this is about teamwork and one of my favorite movies and favorite teams was...
Love the Blues Brothers! And Blue Brothers 2000 is a highly underrated movie... not for the acting... but the music! Anyways, back on topic. So what we want to talk about today is collaborating on a problem, designing a solution, and deploying it using Action Orchestrator.
So this week we ran into an issue where our production CCS/AO system went fully down... not because of anything we did, but because of something in Kubernetes that we should have been monitoring for. The certificates for the client side API calls expired. Thusly making our micro-services unable to talk to each other or the Kubernetes cluster itself... well that was not good! And yes, we did open a bug, CSCvt21378, on this so you guys can track it once it becomes customer facing! Anyways, so the whole cluster goes down... wow! That is not good.
So I started to dig through tech articles, Kubernetes settings, and working with other Cisco folks to start designing a "fix" in the knowledge that other customers with private cloud based CloudCenter Suite(CCS) systems would see the same issue. It was time to go into proactive mode to start looking for how we could detect this and get ahead of it! A little bit of downtime to implement a workaround/fix was OK... but a whole day or more is just unacceptable!
After a solid day of working on it I designed a workaround/solution to fix the above issue. It will be coming out public as a knowledge article in short time or you can get the information through TAC... however that was not *GOOD* enough. We in CX can do better! We can innovate... we can AUTOMATE! In talking with a fellow CX Engineer, Jason Davis, he had an idea to throw together a quick bash script to check on the certificate expiration dates for each cert. What a novel and great idea!!! Here is a quick plug for his twitter account - https://twitter.com/snmpguy , he's a great follow! Anyways he designed this script:
for crt in /etc/kubernetes/pki/*.crt; do
printf '%s: %s\n' \
"$(date --date="$(openssl x509 -enddate -noout -in "$crt"|cut -d= -f 2)" --iso-8601)" \
done | sort
Nothing overly fancy, but highly useful! So he had something in place to start looking at certificate dates... but we needed someone to make this reusable and deploy-able to our customer base....
Solving it with Action Orchestrator
Knowing that Jason is a huge AO person as well... we decided to collaborate and team up on this issue. How can we take a small script, package it for usage, and pass it out to the masses? Easy... we can write a workflow to handle it all!! And then publish it... so how do we start? We start with determining what the date was and how far away that was from the expiration date... we run the script to determine the dates of the certificates as well. We can use the Unix/Linux target to connect into one of our masters and run this bash script... like this:
We then needed to use regular expressions to filter out the data we needed and make it usable. In the video I will reference a great testing site that you can use, it is here: https://regex101.com/. Make sure when you do your regular expressions or look at the code that you use GoLang! (As that is what AO is using)
Then we have to look at the dates and do some similar logic to determine whether we are in a "ok" time period, we need to "warn" the administrators, or the certificate is flat out expired! (OH NO!) We can use a standard condition branch in this case and compare the dates in them to determine the outcome. To be fancy... we are going to create an HTML based email to send to the administrator and run this workflow daily to let them know what is going on...
And lastly we are going to send that nice HTML email to the administrator(like we said above)... it would look something like this:
So awesome... now you have a solution that went from a major problem to a simple script to find the certificate expiration dates to a sweet team effort to create a proactive monitoring workflow to keep your CCS system up and running! How neat is that?? You know what is just as neat? The video where I give you all the insight and talk you through the workflow... so now... ONTO THE VIDEO!
Just a quick ~17 minute video explanation for you. Reminder that the content above in github is opensource in nature and is *NOT* supported by TAC. It was written by some great folks in CX but is open for you to use, take, manipulate, etc. So please... enjoy it!
Again, special thanks to Jason Davis for the collaboration here and a wonderful CX Solution!
Standard End-O-Blog Disclaimer:
Thanks as always to all my wonderful readers and those who continue to stick with and use CPO and AO! I have always wanted to find good questions, scenarios, stories, etc... if you have a question, please ask, if you want to see more, please ask... if you have topic ideas that you want me to blog on, Please ask! I am happy to cater to the readers and make this the best blog you will find :)
AUTOMATION BLOG DISCLAIMER: As always, this is a blog and my (Shaun Roberts) thoughts on CPO, AO, CCS, orchestration, development, devops, and automation, my thoughts on best practices, and my experiences with the products and customers. The above views are in no way representative of Cisco or any of it's partners, etc. None of these views, etc are supported and this is not a place to find standard product support. If you need standard product support please do so via the current call in numbers on Cisco.com or email email@example.com
Hello! My customer recently run a vulnerability test on his Nexus 7000 and 5000 aquipment. The test shows the following vulnerability "Deprecated SSH Cryptographic Settings" within SSH configuration with the following impact "A man-in...
Hi All, We have encountered a problem related to MAC flooding/wrong learning with ACI fabric recently. ACI version : Version 4.1 EPG(VLAN) node getting learned with some other EPG(VLAN). For eg: EPG-10(10.100.100.0/24) node(s) M...
Hello,I have a Nexus 7018 sup1 running on version 6.2(24a) .When I scan the device for vulnerability after the upgrade, it found vulnerability due to "SSH Server CBC Mode Ciphers Enabled".SSH Server CBC Mode Ciphers EnabledSynopsis :The SSH server is conf...
Hello All, I believe we have to reboot the switch to over come this error. but wanted to know if this will drop any traffic, because lately we are observing voice traffic drop in that particular location. Your insights pls. Cheers,J
Hi all, i have:EPG-10 (10.10.10.2, 10.10.10.3 ) link BD-10 (gateway 10.10.10.1/24)EPG-20 (10.10.20.2, 10.10.20.3 ) link BD-20 (gateway 10.10.20.1/24)i not intergrate VMM domain in ACI and i want create useg EPG (10.10.10.3, 10.10.20.2). can i do it?...