ASR9006 upgrade IOS XR fail

nicolay1974 · ‎09-24-2013

Hi

After upgrade ASR9006 from 4.2.0 to 4.3.2 the line card do not boot up properly.

I tryed to upgrade FPD on line card, but had not success. What i can do now?

RP/0/RSP0/CPU0:LAB-9k-440#sh plat
Tue Sep 24 17:53:20.083 UTC
Node            Type                      State            Config State
-----------------------------------------------------------------------------
0/RSP0/CPU0     A9K-RSP440-TR(Active)     IOS XR RUN       PWR,NSHUT,MON
0/0/CPU0        A9K-40GE-L                BRINGDOWN        PWR,NSHUT,MON
0/2/CPU0        A9K-MOD80-TR              IN-RESET         PWR,NSHUT,MON

RP/0/RSP0/CPU0:LAB-9k-440(admin)#upgrade hw-module fpd rommon location 0/0/CPU0

Tue Sep 24 17:57:56.274 UTC

***** UPGRADE WARNING MESSAGE: *****

* This upgrade operation has a maximum timout of 160 minutes. *

* If you are executing the cmd for one specific location and *

* card in that location reloads or goes down for some reason *

* you can press CTRL-C to get back the RP's prompt. *

* If you are executing the cmd for _all_ locations and a node *

* reloads or is down please allow other nodes to finish the *

* upgrade process before pressing CTRL-C. *

% RELOAD REMINDER:

- The upgrade operation of the target module will not interrupt its normal

operation. However, for the changes to take effect, the target module

will need to be manually reloaded after the upgrade operation. This can

be accomplished with the use of "hw-module <target> reload" command.

- If automatic reload operation is desired after the upgrade, please use

the "reload" option at the end of the upgrade command.

- The output of "show hw-module fpd location" command will not display

correct version information after the upgrade if the target module is

not reloaded.

NOTE: Chassis CLI will not be accessible while upgrade is in progress.

Continue? [confirm]

smilstea · ‎09-24-2013

Hi,

FPDs cannot be updated when a card is booting.

BRINGDOWN just means the card is reloading.

IN-RESET means the card has failed to boot too many times so the system disables it. You can get out of this state via manual intervention such as the hw-mod reload command.

What is the highest node state the card reach before resetting? Do the cards hit present, rommon, mbi-boot, mbi-run, or xr-run?

This will help to determine what other commands to look at and why the cards do not boot up all the way.

Can you also send the output of 'show log' snipped for card related messages? Something like 'show log | i 0/2/CPU0'

Thanks,

Sam

nicolay1974 · ‎09-25-2013

Hi Sam,

Thank you for your attention,
node 0/2/CPU0 passed through the following state:
mbi-boot => mbi-run => ios XR PREP => rommon => MBI-boot => in-reset
node 0/0/cpu0 passed through the following state:

0/0/CPU0

PRESENT

ROMMON

BRINGDOWN

log

RP/0/RSP0/CPU0:Sep 26 15:22:08.813 : config[65744]: %MGBL-SYS-5-CONFIG_I : Configured from console by admin

RP/0/RSP0/CPU0:Sep 26 15:23:10.636 : shelfmgr[389]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/0/CPU0 A9K-40GE-L state:PRESENT

RP/0/RSP0/CPU0:Sep 26 15:23:10.692 : config[65861]: %MGBL-CONFIG-6-DB_COMMIT_ADMIN : Configuration committed by user 'admin'. Use 'show configuration commit changes 2000000016' to view the changes.

RP/0/RSP0/CPU0:Sep 26 15:23:12.855 : config[65861]: %MGBL-SYS-5-CONFIG_I : Configured from console by admin

RP/0/RSP0/CPU0:Sep 26 15:25:40.912 : shelfmgr[389]: %PLATFORM-SHELFMGR-3-FSMTIMEOUT_RESET : Node 0/0/CPU0 is reset due to failed bootup. Node state was: 1 Timeout ID: 10

RP/0/RSP0/CPU0:Sep 26 15:25:40.935 : canb-server[150]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/0/CPU0 , Power Cycle (0x05000000)

RP/0/RSP0/CPU0:Sep 26 15:25:40.935 : shelfmgr[389]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/0/CPU0 A9K-40GE-L state:ROMMON

RP/0/RSP0/CPU0:Sep 26 15:28:11.214 : shelfmgr[389]: %PLATFORM-SHELFMGR-3-FSMTIMEOUT_RESET : Node 0/0/CPU0 is reset due to failed bootup. Node state was: 3 Timeout ID: 10

RP/0/RSP0/CPU0:Sep 26 15:28:11.236 : canb-server[150]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/0/CPU0 , Power Cycle (0x05000000)

RP/0/RSP0/CPU0:Sep 26 15:28:11.237 : shelfmgr[389]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/0/CPU0 A9K-40GE-L state:BRINGDOWN

RP/0/RSP0/CPU0:Sep 26 15:28:11.238 : invmgr[255]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node: 0/0/CPU0, state: BRINGDOWN

RP/0/RSP0/CPU0:Sep 26 15:30:41.513 : shelfmgr[389]: %PLATFORM-SHELFMGR-3-FSMTIMEOUT_RESET : Node 0/0/CPU0 is reset due to failed bootup. Node state was: 7 Timeout ID: 10

RP/0/RSP0/CPU0:Sep 26 15:30:41.537 : canb-server[150]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/0/CPU0 , Power Cycle (0x05000000)

I attached log file for node

node 0/2/CPU0

Nicolay.

PS: i did my upgrade follow this link

http://www.cisco.com/web/Cisco_IOS_XR_Software/pdf/ASR9000_Upgrade_Procedure_432.pdf

smilstea · ‎10-02-2013

Hi Nicolay,

Sorry for the delay I was on vacation until today.

The following logs are of interest to me, mostly the NP init failure.

lda_server[65]: %L2-SPA-5-STATE_CHANGE : SPA in bay 0 type A9K-MPA-4x10GE Initing

LC/0/2/CPU0:Sep 25 15:33:37.422 : prm_server_ty[303]: %PLATFORM-NP-0-INIT_ERR : In spite of 3 Cold restarts, NP init unsuccessful...exitting!!

LC/0/2/CPU0:Sep 25 15:33:38.502 : sysmgr[91]: %OS-SYSMGR-3-ERROR : prm_server_ty(1) (jid 303) exited, will be respawned with a delay (slow-restart)

LC/0/2/CPU0:Sep 25 15:33:38.501 : sysmgr[91]: prm_server_ty(1) (jid 303) (pid 524413) (fail_count 2) abnormally terminated, restart scheduled

LC/0/2/CPU0:Sep 25 15:33:38.504 : sysmgr[91]: %OS-SYSMGR-3-ERROR : prm_server_ty(303) (fail count 2) will be respawned in 10 seconds

LC/0/2/CPU0:Sep 25 15:33:38.504 : sysmgr[91]: %OS-SYSMGR-7-DEBUG : prm_server_ty[303] (pid 524413) has not sent proc-ready within 45 seconds

LC/0/2/CPU0:Sep 25 15:33:48.484 : pifibm_server_lc[292]: %OS-PLATFORM_LPTS_PIFIB-7-ERR_CONN_INIT : Failed to connect to PRM sever: Improper link

LC/0/2/CPU0:Sep 25 15:33:48.655 : sysmgr[91]: %OS-SYSMGR-3-ERROR : inline_service_proc(1) (jid 209) exited, will be respawned with a delay (slow-restart)

LC/0/2/CPU0:Sep 25 15:33:48.659 : sysmgr[91]: %OS-SYSMGR-3-ERROR : inline_service_proc(209) (fail count 1) will be respawned in 10 seconds

LC/0/2/CPU0:Sep 25 15:33:48.651 : dumper[56]: %OS-DUMPER-7-DUMP_REQUEST : Dump request for process pkg/bin/pifibm_server_lc

LC/0/2/CPU0:Sep 25 15:33:48.662 : sysmgr[91]: %OS-SYSMGR-7-DEBUG : inline_service_proc(1) (jid 209) did not signal end of initialization

LC/0/2/CPU0:Sep 25 15:33:48.653 : sysmgr[91]: inline_service_proc(1) (jid 209) (pid 524400) (fail_count 1) abnormally terminated, restart scheduled

LC/0/2/CPU0:Sep 25 15:33:48.727 : pm[294]: %PLATFORM-VKG_PM-3-ERROR_INIT : PM: initialization error encountered, reason=failed to initialize prm stats, pm exits!

LC/0/2/CPU0:Sep 25 15:33:48.941 : sysmgr[91]: pm(1) (jid 294) (pid 524371) (fail_count 1) abnormally terminated, restart scheduled

LC/0/2/CPU0:Sep 25 15:33:48.941 : sysmgr[91]: %OS-SYSMGR-3-ERROR : pm(1) (jid 294) exited, will be respawned with a delay (slow-restart)

LC/0/2/CPU0:Sep 25 15:33:48.942 : sysmgr[91]: %OS-SYSMGR-3-ERROR : pm(294) (fail count 1) will be respawned in 10 seconds

LC/0/2/CPU0:Sep 25 15:33:48.998 : fib_mgr[176]: %PLATFORM-PLAT_FIB_HAL-3-ERR_INFO : fib HAL failed to initialize engine hardware : 18 : pkg/bin/fib_mgr : (PID=524398) : -Traceback= 4db19210 4d8f2b4c 40003f38 40001da4 4ba73a44 4ba71554 400003f0 4000211c 40003078 400000e4 40172470

LC/0/2/CPU0:Sep 25 15:33:49.003 : fib_mgr[176]: %ROUTING-FIB-2-INIT : FIB initialization failed on this node. Reason: Platform init returned hard error. Decoded error reason: Improper link

LC/0/2/CPU0:Sep 25 15:33:49.162 : sysmgr[91]: fib_mgr(1) (jid 176) (pid 524398) (fail_count 1) abnormally terminated, restart scheduled

LC/0/2/CPU0:Sep 25 15:33:49.163 : sysmgr[91]: %OS-SYSMGR-3-ERROR : fib_mgr(1) (jid 176) exited, will be respawned with a delay (slow-restart)

LC/0/2/CPU0:Sep 25 15:33:49.164 : sysmgr[91]: %OS-SYSMGR-3-ERROR : fib_mgr(176) (fail count 1) will be respawned in 10 seconds

LC/0/2/CPU0:Sep 25 15:33:49.164 : sysmgr[91]: %OS-SYSMGR-7-DEBUG : fib_mgr(1) (jid 176) did not signal end of initialization

LC/0/2/CPU0:Sep 25 15:33:49.324 : prm_server_ty[303]: %PLATFORM-NP-0-INIT_ERR : In spite of 3 Cold restarts, NP init unsuccessful...exitting!!

LC/0/2/CPU0:Sep 25 15:33:49.655 : ipv6_mfwd_partner[245]: %ROUTING-IPV4_MFWD-3-ERR_MLIB_INIT : Failed to initialize Multicast Library Improper link

Can you open a TAC case for this?

This typically indicates a HW failure.

Thanks,

Sam

nicolay1974 · ‎10-09-2013

Hi Sam,

I have opened TAC case and initiated RMA procedure, but I don`t understand so why this is happened.

Did I need remove line card before upgrade?

Thank you.

Nicolay.

smilstea · ‎10-15-2013

Hi Nicolay,

This basically means faulty HW, no faults from anything you did based upon the above logs.

Thanks,

Sam

smailmilak · ‎01-16-2015

Hi,

we had a failure on one LC. Is this a SW or HW failure?

We are running ASR9010 with 4.3.1 and LC is A9K-8T-L.

Here are the logs:

LC/0/0/CPU0:Jan 15 03:56:40.775 : prm_server_tr[292]: %PLATFORM-NP-4-FAULT : prm_process_parity_tm_cluster: 1 Unrecoverable error(s) found. Reset NP4 now
LC/0/0/CPU0:Jan 15 03:56:42.858 : ipv4_mfwd_partner[230]: %ROUTING-IPV4_MFWD-4-FROM_MRIB_UPDATE : MFIB couldn't process update from MRIB : failed to create route 0xe0000000:(10.120.3.77,239.192.4.40/32) - 'asr9k-ipmcast' detected the 'warning' condition 'Platform MFIB: Platform Lib not ready; NP Not running'
LC/0/0/CPU0:Jan 15 03:56:52.185 : pfm_node_lc[282]: %PLATFORM-NP-0-TMB_CLUSTER_PARITY : Set|prm_server_tr[155731]|Network Processor Unit(0x1008004)|TMb cluster parity interrupt. Indicates an internal SRAM problem in TMb cluster, NP=4 memId=6, mask=0x2000000, PMask=0x2000000 SRAMLine=166 Rec=1 Rewr=1
LC/0/0/CPU0:Jan 15 03:56:52.187 : pfm_node_lc[282]: %PLATFORM-PFM-0-CARD_RESET_REQ : pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID: 155731 (prm_server_tr), Fault Sev: 0, Target node: 0/0/CPU0, CompId: 0x1f, Device Handle: 0x1008004, CondID: 1008, Fault Reason: TMb cluster parity interrupt. Indicates an internal SRAM problem in TMb cluster, NP=4 memId=6, mask=0x2000000, PMask=0x2000000 SRAMLine=166 Rec=1 Rewr=1
RP/0/RSP1/CPU0:Jan 15 03:56:52.380 : shelfmgr[394]: %PLATFORM-SHELFMGR-6-NODE_KERNEL_DUMP_EVENT : Node 0/0/CPU0 indicates it is doing a kernel dump.
RP/0/RSP1/CPU0:Jan 15 03:56:52.381 : shelfmgr[394]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/0/CPU0 A9K-8T-L state:IOS XR FAILURE
RP/0/RSP1/CPU0:Jan 15 03:56:52.384 : ospf[1011]: %ROUTING-OSPF-5-ADJCHG : Process 8000, Nbr 10.100.96.204 on TenGigE0/0/0/1 in area 0 from FULL to DOWN, Neighbor Down: BFD session down, vrf default vrfid 0x60000000
RP/0/RSP1/CPU0:Jan 15 03:56:52.397 : shelfmgr[394]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/0/CPU0 A9K-8T-L state:BRINGDOWN

xthuijs · ‎01-16-2015

to "close" on this and making sure that it is addressed pasting comments from the other discussion on the same item:

ah this: PLATFORM-NP-4-FAULT : prm_process_parity_tm_cluster: 1 Unrecoverable error(s) found.

it means that the NP number 4 on the linecard in slot 0 incurred a memory parity error on the traffic manager portion of the NPU (the portion that handles Q'ing and scheduling) and it could not correct that error and therefore decided to reinit and crash.

Generally with memory parity errors we always advice to catch it once, monitor it and if this happens again to replace the card.

If you are uncomfortable "waiting" until a next event, you could decide to replace it now, but many times parity errors are transient and caused by a what we used to call "cosmic radiation" which is merely an assembly of uncommon not likely to happen events such as a power spike or drop, or other intangible events.

cheers

xander

jmartinez20 · ‎03-22-2016

Hello All,

I get the following messages and the card A9K-40GE-B keeps cycling through IOS XR PREP,MBI-BOOTING,MBI-RUNNING. and it finally putting it IN_RESET state.Any help is truly appreciated.

0/1/CPU0 A9K-40GE-B MBI-BOOTING PWR,NSHUT,MON

RP/0/RSP0/CPU0:Router(admin)#LC/0/1/CPU0:Mar 22 17:31:53.057 : prm_server_tr[305]: %PLATFORM-NP-0-INIT_ERR : (0x8000B002) : Setting up NP0 Failed
LC/0/1/CPU0:Mar 22 17:32:50.031 : pfm_node_lc[293]: %PLATFORM-NP-0-NP_INIT_FAILURE : Set|prm_server_tr[151634]|Network Processor Unit(0x1008000)|Persistent Initialization Failure.
LC/0/1/CPU0:Mar 22 17:32:50.036 : pfm_node_lc[293]: %PLATFORM-PFM-0-CARD_RESET_REQ : pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID: 151634 (prm_server_tr), Fault Sev: 0, Target node: 0/1/CPU0, CompId: 0x1f, Device Handle: 0x1008000, CondID: 1027, Fault Reason: Persistent Initialization Failure.
--------------------------------------------------------------------------------
RP/0/RSP0/CPU0:Router(admin)#RP/0/RSP0/CPU0:Mar 22 17:42:43.665 : shelfmgr[410]: %PLATFORM-SHELFMGR-0-MAX_BOOTREQ_BRINGDOWN : Node 0/1/CPU0 A9K-40GE-B has reset itself in multiple (11) unsuccessful boot attempts, putting it IN_RESET state. The probable cause is an unexpected event on the node. Please Refer to the Cisco ASR 9000 System Error Message Reference Guide for further information if needed.

RP/0/RSP0/CPU0:Router(admin)#sh ver
Tue Mar 22 17:44:02.089 UTC

Cisco IOS XR Software, Version 5.1.0[Default]
Copyright (c) 2013 by Cisco Systems, Inc.

ROM: System Bootstrap, Version 1.06(20120210:003513) [ASR9K ROMMON],

Router uptime is 55 minutes
System image file is "bootflash:disk0/asr9k-os-mbi-5.1.0/0x100000/mbiasr9k-rp.vm"

cisco ASR9K Series (MPC8641D) processor with 4194304K bytes of memory.
MPC8641D processor at 1333MHz, Revision 2.2
ASR 9006 AC Chassis with PEM Version 2

2 Management Ethernet
219k bytes of non-volatile configuration memory.
975M bytes of compact flash card.
67988M bytes of hard disk.
1605616k bytes of disk0: (Sector size 512 bytes).
1605616k bytes of disk1: (Sector size 512 bytes).

Configuration register on node 0/RSP0/CPU0 is 0x2102

xthuijs · ‎03-22-2016

Hi!!

the card has an NP init problem, it was trying to set itself up and tests it attached memory and that failed. after a few tries it gave up and put itself in IN-RESET.

you would want to RMA this board and have it replaced.

xander

jmartinez20 · ‎03-22-2016

NP init problem= NO power initialization problem ?

I apologize for not knowing this, I'm new with the ASR line.

Thanks for your response.

xthuijs · ‎03-22-2016

oh, it means an np initialization error. when the np boots, it tests its memory for search, stats, tcam and packet buffers, if these fail, it is called an np init error.

from the logs you provided I can't tell which mem failed precisely, but regardless, it can't be repaired or salvaged without a hw replacement.

and oh, if you're new and want to see some more, check out cisco Live ID 2904 from orlando, sanfran and sandiego. Possibly also the brkarc id 2003 for some good stuff on a9k. and of course here on the forums! :)

cheers

xander

smailmilak · ‎03-22-2016

Hi,

NP is Network Processor and your Trident based line card has probably four NP's.

Correct me if I am wrong about NP number.

rortizal · ‎04-13-2016

Hi Alexander

How i can install add tar from O/RSP1/CPU0 ??

or i need reload RP to install tar file

smailmilak · ‎04-14-2016

Hi,

you want to upgrade to a newer IOS-XR version?