on 12-08-2015 01:00 PM
Introduction
The nV Edge feature enables logically clustering two ASR9Ks (same chassis types) together acting as one control plane. (For more information on clusters see asr9k-nv-edge-deployment-guide)
This guide is to provide and describe the possible upgrading options and examples.
Upgrade Options
With a cluster we have a few options for upgrade:
1. Upgrading as normal
This means that we treat the cluster as a single logical node and the upgrade consists of applying the new image onto both racks at the same time and then reloading. This means that the whole cluster will take a hit and be upgraded at the same time.
2. Upgrading using the nV Script
This is an expect script that isolates one rack, performs an upgrade, brings the rack back online and orchestrates a failover to keep forwarding active with a minimum outage as a result. The script will help in orchestrating link shut downs and managing the upgrade allowing for a near zero packet loss and a minimal disruption time for topology loss
3. Upgrading manually
taking one rack down and upgrading, but without the script. This is the same as option 2 but doing all steps manually.
We highly recommend using options 1 or 2
SMUs are similar, a reload SMU can be done via the script or normal upgrade, while a process restart will impact both chassis, and hitless is hitless.
Special note: Cluster does not support any form of ISSU, even ISSU SMUs
Note: ISSU smu’s on single chassis were discontinued starting XR 5.2.0.
Option 1: Upgrading as Normal
This is just like any other upgrade, install add, install activate, and install commit. The only difference is that when the router reloads both chassis will reload.
Option 2: Upgrading using the nV Script
This method uses an off-the-box script to minimize down time. When combined with the mandatory cluster configurations (Feature_configuration_caveats) the downtime is very minimal.
With the script it is highly recommended to test the upgrades before going live. Due to differences in terminal software and Linux versions the script may need to be further tweaked.
The following document explains how to get the script and edit it for basic running
http://www.cisco.com/c/en/us/support/docs/routers/asr-9000-series-aggregation-services-routers/117643-config-nvedge-00.html
Common Script Issues
While the script most of the times works offbox directly, there are a few known cases where customization is necessary.
Issue #1: Timer Expires
Depending on the environment, how many packages are loaded, etc an upgrade could take longer. In these cases we may need to increase the timeouts.
In the logs we would see the timer reach 0 seconds.
To remedy this fix where the timer stopped working, typically a ‘wait_for’ clause such as ‘wait_for 30 "CONFIG COMMIT"’.
Issue #2: Terminal Server Disconnect
With this issue we see the script exit, error out, after the reload of one of the racks. When this happens the script is completely terminated. A quick fix for this is to expect this, perform a disconnect manually, use a sleep timer, and reconnect at a later time. This prevents the error from ever occurring.
Insert right after the install activate command:
exp_send "exit\r"
router_disconnect
set sleeptimer 750
while {$sleeptimer > 0} {
send_user "Waiting $sleeptimer seconds before login"
sleep 10
incr sleeptimer -10
}
router_connect $rack0_addr $rack0_port $rack0_stby_addr $rack0_stby_port
set connected_rack 0
send -- "terminal length 0\r"
Issue #3: Install Commit Fails
In some cases it may take longer for a LC to boot up. Due to this install commit will not work until the card has finished booting.
To remedy this we insert a simple loop to wait and try again.
set failedcommit 0
send -- "admin install commit\r"
while {1} {
expect {
-exact "failed" {
set failedcommit 1
break
}
-exact "completed" {
break
}
}
}
if {$failedcommit == 1} {
send_user "Wait 30s to try install commit again"
sleep 30
send -- "admin install commit\r"
while {1} {
expect {
-exact "failed" {
send_user "Install commit failed twice"
break
}
-exact "completed" {
break
}
}
}
}
sleep 5
Issue #4: Install Activate Fails
When the install activate fails, by default, the router will continue with the script as it does not check if the activate succeeded or not. Adding the below code will remedy this and return (exit the script) when the activate fails.
set failedinstall 0
rack1_cmd "admin install activate $image_list parallel-reload prompt none"
while {1} {
expect {
-exact "failed" {
set failedinstall 1
break
}
-exact "completed" {
break
}
}
}
if {$failedinstall == 1} { return }
Appendix A
This appendix contains the updated version of the script with the fixes for all the prior mentioned issues.
Note that not all environments will need every fix.
#!/usr/bin/expect -f
# ---------------------------------------------------------------------
# nv_edge_upgrade.exp - Rack-by-rack upgrade method for ASR9k Cluster Systems
#
# Copyright (c) 2012-2013 by cisco Systems, Inc.
# All rights reserved.
#--------------------------------------------------------------------
#
# Usage Notes:
#
# This script is intended to run on a third-party system capable of CLI
# access via Telnet. This is a sample intended to show how the a
# rack-by-rack upgrade of the cluster may be accomplished using the
# set of CLI commands available, and requires customization on a per
# router basis.
#
# It is HIGHLY RECOMMENDED that a backup of the router configuration is
# made before attempting the software upgrade.
#
# At a minimum the variables below must be customized to the target
# router.
# 1) rack0_addr, rack0_port - This is the telnet information for
# the Nv-Edge Rack 0 chassis.
# 2) rack1_addr, rack1_port - This is the telnet information for
# the Nv-Edge Rack 1 chassis.
# 3) router_username / router_password - The login information to
# be used to conduct the software upgrade
# 4) image_list - Space delimited list of the pre-loaded packages on
# the router to be activated. These may be pies of any sort,
# including full system images or SMU's.
# 5) irl_list - The list of configured IRL interfaces on the cluster.
# For Ironman dont set stanby address/port info.
# The timing of the CLI commands injected by the script is based on a
# typical cluster system scale. These may need to be adjusted for larger
# scale systems. During any wait operation of the script, the user may
# manually extend the current timer wait via the "+" key.
#
set rack0_addr "172.18.230.244"
set rack0_port "2041"
set rack0_stby_addr ""
set rack0_stby_port ""
set rack1_addr "172.18.226.115"
set rack1_port "2067"
set rack1_stby_addr ""
set rack1_stby_port ""
set router_username "lab"
set router_password "lab"
set image_list "disk0:asr9k-mini-px-5.2.2 disk0:asr9k-9000v-nV-px-5.2.2 disk0:asr9k-mgbl-px-5.2.2 disk0:asr9k-mpls-px-5.2.2"
set irl_list {{TenGigE0/0/2/3} {TenGigE1/0/2/3} }
# ----------------------------- -----------------------------
set debug_mode 0
# ----------------------------- -----------------------------
set timeout 30
set connected_rack -1
set RSP "UNK"
set im_chassis 0
proc router_connect { router_addr router_port stby_addr stby_port } {
global debug_mode
global spawn_id
global router_username
global router_password
global im_chassis
if {$debug_mode == 1} { return }
spawn telnet $router_addr $router_port
match_max 100000
sleep 1
send -- "\r"
while {1} {
expect {
-exact "Username: " {
send_user "Matched Username\n"
send -- $router_username
send -- "\r"
}
-exact "Password: " {
send_user "Matched Password\n"
send -- $router_password
send -- "\r"
}
-exact "(config)" {
send -- "end\r"
}
-exact "(admin-config)" {
send -- "end\r"
}
-exact "(admin)" {
send -- "exit\r"
}
-exact "#" {
send_user "Matched Prompt\n"
send -- "terminal length 0\r"
break
}
-exact "not ready or active" {
if { $im_chassis == 0 } {
send_user "Connected to Stby\n"
send -- ""
sleep 1
expect -exact "telnet> "
send -- "quit\r"
expect eof
spawn telnet $stby_addr $stby_port
match_max 100000
sleep 1
send -- "\r"
} else {
send_user "Node not ready.\n"
send -- "\35"
sleep 1
expect -exact "telnet> "
send -- "quit\r"
expect eof
}
}
timeout {
send_user "Timeout"
send -- "\r"
}
}
}
}
proc router_disconnect { } {
global debug_mode
global connected_rack
if {$debug_mode == 1} { return }
send -- "\35"
sleep 1
expect -exact "telnet> "
send -- "quit\r"
expect eof
set connected_rack -1
sleep 5
}
proc send_command { inString } {
global debug_mode
if {$debug_mode == 1} {
send_user $inString
send_user "\n"
return
}
send -- "\r"
expect -exact "#"
send -- $inString
send -- "\r"
sleep 1
expect -exact "#"
}
proc rack0_cmd { inString } {
global debug_mode
global connected_rack
global rack0_addr
global rack0_port
global rack0_stby_addr
global rack0_stby_port
if {$debug_mode == 0} {
if {$connected_rack == 1} {
router_disconnect
}
if {$connected_rack == -1} {
router_connect $rack0_addr $rack0_port $rack0_stby_addr $rack0_stby_port
set connected_rack 0
}
}
send_command $inString
}
proc rack1_cmd { inString } {
global debug_mode
global connected_rack
global rack1_addr
global rack1_port
global rack1_stby_addr
global rack1_stby_port
if {$debug_mode == 0} {
if {$connected_rack == 0} {
router_disconnect
}
if {$connected_rack == -1} {
router_connect $rack1_addr $rack1_port $rack1_stby_addr $rack1_stby_port
set connected_rack 1
}
}
send_command $inString
}
proc wait_for { waitTime waitReason } {
global debug_mode
if {$debug_mode == 1} { return }
set sleep_time $waitTime
while {$sleep_time > 0} {
set time_out 10
send_user -- "--- WAITING FOR $waitReason $sleep_time SECONDS (~~ to abort / + to add time) ---\n"
interact {
timeout $time_out { incr sleep_time -10 ; return }
+ {incr sleep_time 30 ; return }
~~ {set sleep_time 0 ; break }
}
}
send_user -- "\n"
exp_send "exit\r"
exp_send "exit\r"
exp_send "exit\r"
router_disconnect
}
# ----------------------------- -----------------------------
# ----------------------------- -----------------------------
## USAGE WARNING
puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#"
puts "This CLI Script performs a software upgrade on"
puts "an ASR9k Nv Edge system, using a rack-by-rack"
puts "parallel reload method. This script will modify"
puts "the configuration of the router, and will incur"
puts "traffic loss."
puts "\nDo you wish to continue \[y/n\]"
set user_confirm 0
set user_input [gets stdin]
if {[regexp \[Yy\] $user_input] == 1} {
set user_confirm 1
}
### CONSOLE SETUP
# Disable logging / pauses to avoid bad script interactions
rack0_cmd "terminal length 0"
rack0_cmd "config"
rack0_cmd "log console disable"
rack0_cmd "commit"
rack0_cmd "exit"
rack0_cmd "admin config"
rack0_cmd "nv edge data"
rack0_cmd "minimum 1 specific rack 1"
rack0_cmd "commit"
rack0_cmd "exit"
rack0_cmd "exit"
rack0_cmd "exit"
rack0_cmd "exit"
## Determine RSP vs. RP card type
send -- "admin show dsc\r"
if {$debug_mode != 1} {
expect {
-exact "RSP" {
set RSP "RSP"
}
-exact "RP" {
set RSP "RP"
}
}
}
## Determine if this is Ironman
send -- "admin show inventory chassis\r"
if {$debug_mode != 1} {
expect {
-exact "ASR-9001" {
set im_chassis 1
}
}
}
rack0_cmd "show platform"
rack0_cmd [format "show nv edge data forwarding loc 0/%s0/CPU0" $RSP]
### RACK 1 SHUTDOWN
# Kill traffic on Rack 1
# Segment the Cluster into individual chassis
rack0_cmd "show install inactive summary"
rack0_cmd [format "show nv edge data forwarding loc 0/%s0/CPU0" $RSP]
# Disable IRL Links
rack0_cmd "config"
foreach irl $irl_list {
rack0_cmd "interface $irl"
rack0_cmd "shut"
}
rack0_cmd "commit"
rack0_cmd "end"
rack0_cmd "show error-disable"
rack0_cmd "admin config"
rack0_cmd [format "nv edge control control-link disable 0 loc 0/%s0/CPU0" $RSP]
rack0_cmd [format "nv edge control control-link disable 0 loc 1/%s0/CPU0" $RSP]
rack0_cmd [format "nv edge control control-link disable 1 loc 0/%s0/CPU0" $RSP]
rack0_cmd [format "nv edge control control-link disable 1 loc 1/%s0/CPU0" $RSP]
if { $im_chassis == 0 } {
rack0_cmd [format "nv edge control control-link disable 0 loc 0/%s1/CPU0" $RSP]
rack0_cmd [format "nv edge control control-link disable 1 loc 0/%s1/CPU0" $RSP]
rack0_cmd [format "nv edge control control-link disable 0 loc 1/%s1/CPU0" $RSP]
rack0_cmd [format "nv edge control control-link disable 1 loc 1/%s1/CPU0" $RSP]
}
rack0_cmd "commit"
rack0_cmd "exit"
rack0_cmd "admin show dsc"
rack0_cmd "exit"
rack1_cmd "admin show dsc"
rack1_cmd "exit"
## Splitting the cluster to single nodes can take anywhere from 3 minutes to
## 10 minutes for the inventory cleanup to occur. Before this, config changes will
## fail.
##
## Polling on any CLI that uses the "location" parameter confirms the
## inventory split is done.
while {1} {
wait_for 90 "CLUSTER SEGMENT"
router_connect $rack1_addr $rack1_port $rack1_stby_addr $rack1_stby_port
set connected_rack 1
send -- "terminal length 0\r"
send -- [format "show nv edge control control-link-protocol loc 1/%s0/CPU0\r" $RSP]
if {$debug_mode == 1} { break }
expect {
-exact "Active Priority" {
break
}
}
}
### RACK 1 ACTIVATE
rack1_cmd "show error-disable"
rack1_cmd "admin config"
rack1_cmd [format "no nv edge control control-link disable 0 loc 1/%s0/CPU0" $RSP]
rack1_cmd [format "no nv edge control control-link disable 1 loc 1/%s0/CPU0" $RSP]
if { $im_chassis == 0 } {
rack1_cmd [format "no nv edge control control-link disable 0 loc 1/%s1/CPU0" $RSP]
rack1_cmd [format "no nv edge control control-link disable 1 loc 1/%s1/CPU0" $RSP]
}
rack1_cmd "commit"
rack1_cmd "exit"
## Sometimes the commit doesn't clear the variables?
rack1_cmd [format "run on -f node1_%s0_CPU0 nvram_rommonvar CLUSTER_0_DISABLE 0" $RSP]
rack1_cmd [format "run on -f node1_%s0_CPU0 nvram_rommonvar CLUSTER_1_DISABLE 0" $RSP]
if { $im_chassis == 0 } {
rack1_cmd [format "run on -f node1_%s1_CPU0 nvram_rommonvar CLUSTER_0_DISABLE 0" $RSP]
rack1_cmd [format "run on -f node1_%s1_CPU0 nvram_rommonvar CLUSTER_1_DISABLE 0" $RSP]
}
rack1_cmd "exit"
wait_for 30 "CONFIG COMMIT"
set failedinstall 0
rack1_cmd "admin install activate $image_list parallel-reload prompt none"
while {1} {
expect {
-exact "failed" {
set failedinstall 1
break
}
-exact "completed" {
break
}
}
}
if {$failedinstall == 1} { return }
exp_send "exit\r"
router_disconnect
set sleeptimer 750
while {$sleeptimer > 0} {
send_user "Waiting $sleeptimer seconds before login"
sleep 10
incr sleeptimer -10
}
router_connect $rack1_addr $rack1_port $rack1_stby_addr $rack1_stby_port
set connected_rack 1
send -- "terminal length 0\r"
set failedcommit 0
send -- "admin install commit\r"
while {1} {
expect {
-exact "failed" {
set failedcommit 1
break
}
-exact "completed" {
break
}
}
}
if {$failedcommit == 1} {
send_user "Wait 30s to try install commit again"
sleep 30
send -- "admin install commit\r"
while {1} {
expect {
-exact "failed" {
send_user "Install commit failed twice"
break
}
-exact "completed" {
break
}
}
}
}
sleep 5
rack1_cmd "admin config"
rack1_cmd [format "no nv edge control control-link disable 0 loc 0/%s0/CPU0" $RSP]
rack1_cmd [format "no nv edge control control-link disable 1 loc 0/%s0/CPU0" $RSP]
if {$im_chassis == 0 } {
rack1_cmd [format "no nv edge control control-link disable 0 loc 0/%s1/CPU0" $RSP]
rack1_cmd [format "no nv edge control control-link disable 1 loc 0/%s1/CPU0" $RSP]
}
rack1_cmd "commit"
rack1_cmd "exit"
rack1_cmd "admin config"
rack1_cmd "nv edge control serial single"
rack1_cmd "commit"
rack1_cmd "exit"
exp_send "exit\r"
router_disconnect
### CRITICAL FAILOVER PHASE
rack0_cmd "admin config"
rack0_cmd "nv edge data"
rack0_cmd "minimum 1 specific rack 0"
rack0_cmd "commit"
rack0_cmd "exit"
rack0_cmd "exit"
rack0_cmd "exit"
rack0_cmd "exit"
exp_send "exit\r"
rack1_cmd "admin config"
rack1_cmd "nv edge data"
rack1_cmd "minimum 1 specific rack 0"
rack1_cmd "commit"
rack1_cmd "exit"
rack1_cmd "exit"
rack1_cmd "exit"
rack1_cmd "exit"
exp_send "exit\r"
router_disconnect
### RACK 0 ACTIVATION
rack0_cmd [format "run on -f node0_%s0_CPU0 nvram_rommonvar CLUSTER_0_DISABLE 0" $RSP]
rack0_cmd [format "run on -f node0_%s0_CPU0 nvram_rommonvar CLUSTER_1_DISABLE 0" $RSP]
if { $im_chassis == 0 } {
rack0_cmd [format "run on -f node0_%s1_CPU0 nvram_rommonvar CLUSTER_0_DISABLE 0" $RSP]
rack0_cmd [format "run on -f node0_%s1_CPU0 nvram_rommonvar CLUSTER_1_DISABLE 0" $RSP]
}
set failedinstall 0
rack0_cmd "admin install activate $image_list parallel-reload prompt none"
while {1} {
expect {
-exact "failed" {
set failedinstall 1
break
}
-exact "completed" {
break
}
}
}
if {$failedinstall == 1} { return }
exp_send "exit\r"
router_disconnect
set sleeptimer 750
while {$sleeptimer > 0} {
send_user "Waiting $sleeptimer seconds before login"
sleep 10
incr sleeptimer -10
}
router_connect $rack0_addr $rack0_port $rack0_stby_addr $rack0_stby_port
set connected_rack 0
send -- "terminal length 0\r"
## Code to Enable IRL links goes HERE
rack0_cmd "config"
foreach irl $irl_list {
rack0_cmd "interface $irl"
rack0_cmd "no shut"
}
rack0_cmd "commit"
rack0_cmd "end"
rack0_cmd "terminal length 0"
rack0_cmd "show error-disable"
rack0_cmd "admin show platform"
rack0_cmd "admin show dsc"
rack0_cmd "admin config"
rack0_cmd "no nv edge control serial single"
rack0_cmd "nv edge data"
rack0_cmd "minimum 1 backup"
rack0_cmd "exit"
rack0_cmd "commit"
rack0_cmd "exit"
rack0_cmd "exit"
rack0_cmd "exit"
rack0_cmd "admin show platform"
rack0_cmd "admin show dsc"
rack0_cmd "exit"
Disclaimer:
nV edge while still supported, is not the longer term ASR9K strategy from a development perspective. This is because the new XR and ISSU capabilities in 6.1.0 and above. Also high density chassis can be used for increased number of ports, which is the preferred operation forward.
The nV edge upgrade script is supported by a few passionate people, and is not officially driven by TAC for that matter.
However we do care about the people that use it and like it. This forum is merely to provide assistance and share learnings of findings so the upgrade script remains usable.
Contributed to by:
Sam Milstead - Customer Support Engineer (HTTS)
Xander Thuijs - Principal Engineer (HERO BU)
This part under the rack 1 activate has the incorrect router connect string:
if {$failedinstall == 1} { return }
exp_send "exit\r"
router_disconnect
set sleeptimer 750
while {$sleeptimer > 0} {
send_user "Waiting $sleeptimer seconds before login"
sleep 10
incr sleeptimer -10
}
router_connect $rack0_addr $rack0_port $rack0_stby_addr $rack0_stby_port
set connected_rack 1
send -- "terminal length 0\r"
This causes the script to log into rack 0 and run the commands and forces rack 1 to reboot.
It needs to be the rack 1 router connect string.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: