09-18-2021 02:23 PM
Hi,
I am new to python and have written a script based on my experience so far with python to automate the process of an IOS upgrade. I have over 500 devices within my organization to upgrade their images to the latest ios. I have written my code in two stages: a pre-upgrade stage and an upgrade + post-upgrade stage.
The pre-upgrade stage is scripted to log in to the devices one after the other to collect data such as - hostname, current version, model, flash space availaibility etc.
Code worked well within a lab environment but testing it out on the live devices throughs the error below:
"Socket exception: An existing connection was forcibly closed by the remote host
(10054)"
Logs on the cisco router had this : %SSH-4-SSH2_UNEXPECTED_MSG: Unexpected message type has arrived. Terminating the connection from x.x.x.x.
Code: (Truncated to improve readability)
import os import subprocess, re, netmiko from netmiko import ConnectHandler from netmiko.ssh_exception import NetmikoTimeoutException from netmiko.ssh_exception import SSHException from netmiko.ssh_exception import AuthenticationException from netmiko import SCPConn from datetime import datetime, time import time import csv #Cisco ISR Data new_ios_ISR_size = "702197190" #Cisco C800 Data new_ios_c800_size = "97199776" #Cisco C1900 Data new_ios_c1900_size = "41735808" ########################################################################### #Creating the CSV files for pre upgrade #clearing the old data from the pre-upgrade CSV file and writing the headers f = open("data/pre_upgrade.csv", "w+") f.write("IP Address, Hostname, Uptime, Current_Version, Current_Image, Serial_Number, Device_Model, Device_Memory, Available_space, space") f.write("\n") f.close() #clearing the old data from the logs file and writing the headers f = open("data/logs.txt", "w+") f.close() now = datetime.now() logs_time = now.strftime("%H:%M:%S") ############################################################################################################################# def preupgrade(): username = 'xxxxxxxxxx' password = 'xxxxxxxxx' with open('data/ip_address.csv', 'r') as file: reader = csv.reader(file) num = 0 for row in reader: ip = row[0] num += 1 print(ip, "Device " + str(num)) device = { 'device_type': 'cisco_ios', 'host': ip, 'username': username, 'password': password } now = datetime.now() logs_time = now.strftime("%H:%M:%S") print ("" + logs_time + ": " + ip + " Logging in to device") f = open("data/logs.txt", "a") f.write("" + logs_time + ": " + ip + " Logging in to device " + "\n" ) f.close() #handling exceptions errors try: net_connect = ConnectHandler(**device) except (NetmikoTimeoutException, AuthenticationException, SSHException, ValueError, TimeoutError, ConnectionError, OSError): now = datetime.now() logs_time = now.strftime("%H:%M:%S") print ("" + logs_time + ": " + ip + " device login issue ") f = open("data/logs.txt", "a") f.write("" + logs_time + ": " + ip + " device login issue " + "\n" + "\n" ) f.close() continue #list where informations will be stored pre_upgrade_devices = [] now = datetime.now() logs_time = now.strftime("%H:%M:%S") print ("" + logs_time + ": " + ip + " Log in Successfuul, Collecting device pre-upgrade report") f = open("data/logs.txt", "a") f.write("" + logs_time + ": " + ip + " Log in Successfuul, Collecting device pre-upgrade report " + "\n" ) f.close() # execute show version on router and save output to output object sh_ver_output = net_connect.send_command('show version') #now = datetime.now() #logs_time = now.strftime("%H:%M:%S") #print ("" + logs_time + ": " + ip + " Checking the version ") #finding hostname in output using regular expressions regex_hostname = re.compile(r'(\S+)\suptime') hostname = regex_hostname.findall(sh_ver_output
Notes:
@Seb Rupik @Alexander Stevenson @dekwan
Solved! Go to Solution.
09-20-2021 07:07 AM - edited 09-20-2021 07:39 AM
Hi @Eseharrison88
Not sure how helpful this will be or if it is what you want to hear but in the spirit of setting expectations and some lessons learned, I wanted to share.
I've undertaken projects just like yours for some of my clients including a global company with thousands of switches in each region with both Python/Netmiko and Ansible.
I've never seen a 100% success rate (in a production environment of any significant size)).
I have seen ~90% success for companies who are very rigorous about standard configurations and limiting the number of models in their environment and being extremely fastidious about code versions.
Even with all of that we still ran into issues:
Those are just off the top of my head.
Set up your scripts to log sessions and be ready to log alot! Logging the sessions I had issues with usually led to the culprit. Log a successful session so you have a "known working" log.
Eventually you will get tired of how long this takes and look at multi-threading and doing more than one at a time.
We kept it to 5-10 at a time because we were using and HTTP server on my laptop. If you can get a "real" https server you should be able to scale better.
I know I've mentioned lots of issues but don't be discouraged. Even at 80% success rate thats 400 devices you didn't have to upgrade manually. I'd call that a big WIN and certainly worth your effort! Eventually on the remaining 100 you will find root cause and be able to either fix it or address it in some way...and they are often things that needed fixing so your network (an your next upgrade) is better off.
Expect that you will be spending more time on the code to check things than the code that actually does the file transfer and reload.
We both started in exactly the same way. Do i have enough space on the flash to transfer the new file?
If it were easy it would be boring, right?
Good luck & Happy Coding!
09-19-2021 07:14 AM
@Eseharrison88 if this work and then does not work - chances are it is not your code. See this thread here --> https://networkengineering.stackexchange.com/questions/45168/cisco-ssh-disconnect
Hope this helps.
09-20-2021 04:10 AM
Thanks @bigevilbeard for your reply.
I've checked the thread and it would seem it's more of a bug on the device which could be fixed with upgrading the ios. However, doing this manually defeats the purpose of the scripts itself.
09-20-2021 01:43 AM
Hi there,
I would agree with @bigevilbeard , if the contents of your script do not change from day to day, but the result of running against a device does, then point the finger of blame at the device. Do the production devices and lab devices run the same software versions?
Also, couldn't help notice at the top of your script the three variables relating to image file size. If you want to validate an image after it is copied to a device it would be better to check it's MD5 or SHA hash.
cheers,
Seb.
09-20-2021 04:16 AM - edited 09-20-2021 04:17 AM
Hi Seb,
The variable with the image size is pre-upgrade, just to verify the devices have available space on flash. I'll be sure to include the MD5 checksum during post-upgrade script run. Thanks.
Both production and test device are running same image and both return same error at intervals.
09-20-2021 07:07 AM - edited 09-20-2021 07:39 AM
Hi @Eseharrison88
Not sure how helpful this will be or if it is what you want to hear but in the spirit of setting expectations and some lessons learned, I wanted to share.
I've undertaken projects just like yours for some of my clients including a global company with thousands of switches in each region with both Python/Netmiko and Ansible.
I've never seen a 100% success rate (in a production environment of any significant size)).
I have seen ~90% success for companies who are very rigorous about standard configurations and limiting the number of models in their environment and being extremely fastidious about code versions.
Even with all of that we still ran into issues:
Those are just off the top of my head.
Set up your scripts to log sessions and be ready to log alot! Logging the sessions I had issues with usually led to the culprit. Log a successful session so you have a "known working" log.
Eventually you will get tired of how long this takes and look at multi-threading and doing more than one at a time.
We kept it to 5-10 at a time because we were using and HTTP server on my laptop. If you can get a "real" https server you should be able to scale better.
I know I've mentioned lots of issues but don't be discouraged. Even at 80% success rate thats 400 devices you didn't have to upgrade manually. I'd call that a big WIN and certainly worth your effort! Eventually on the remaining 100 you will find root cause and be able to either fix it or address it in some way...and they are often things that needed fixing so your network (an your next upgrade) is better off.
Expect that you will be spending more time on the code to check things than the code that actually does the file transfer and reload.
We both started in exactly the same way. Do i have enough space on the flash to transfer the new file?
If it were easy it would be boring, right?
Good luck & Happy Coding!
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide