cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3369
Views
5
Helpful
6
Replies

Fatal fabric fault on stanby Route Processor (RP2)

Marks Maslovs
Level 1
Level 1

Hello support community,

We have encountered unexpected behavior after performing a RP switchover on ASR9922 chassis.

The switchover itself works perfectly, but when previously active RP goes up in stanby state, it boots IOS-XR, starts to synchronize states with active RP, then suddenly it fails, and goes again into reboot.

Reseating the RP does not help. We got same errors.

RP/0/RP0/CPU0:Mar 1 04:45:48.760 : pfm_node_rp[370]: %PLATFORM-PUNT-0-FATAL_FAULT : Set|fiarsp[213107]|0x101b000|FATAL FPGA error: Module 0 Lane 0: toTgrShPktFifoFull
RP/0/RP0/CPU0:Mar 1 04:45:49.762 : pfm_node_rp[370]: %FABRIC-FIA-0-FATAL_INTERRUPT_ERROR : Set|fiarsp[213107]|0x1071000|FIA fatal error interrupt on FIA 0: ^D
RP/0/RP1/CPU0:Mar 1 04:45:49.769 : FABMGR[229]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/RP0/CPU0 (slot 0) encountered fatal fabric fault.The card would undergo reload.
RP/0/RP1/CPU0:Mar 1 04:45:51.771 : pfm_node_rp[370]: %PLATFORM-FABMGR-0-ASIC_ERROR : Set|fabmgr[213101]|0x1033000|Fabmgr encountered fault on standby RSP|Target Node:0/RP0/CPU0

I would like to clarify, if anyone has experienced similar behavior of Route Processor (A99-RP2-TR)? 

Maybe someone has any ideas what could be the reason of such fault?

Thanx in advance!

6 Replies 6

Aleksandar Vidakovic
Cisco Employee
Cisco Employee

hi Marks,

which XR release are you running? There might be an issue with the RP or with some of the fabric cards. The troubleshooting may not be easy, so it would be the best if you opened a TAC SR. Please provide "sh tech fabric" when opening the SR. Console log from the standby RP would also help to see whether some other issues are reported.

/Aleksandar

xthuijs
Cisco Employee
Cisco Employee

I think your RP0 is having an fpga issue on the FIA asic (fab interface asic) that is used by the RP's punt inject asic to inject packets towarsd the fabric.

it is raising an errortoTgrShPktFifoFull on the punt asic because it cant get rid of its packets anymore towards the FIA (or the fia is not draining them correctly).

The FIA raises an interrupt that is generated 0x1071000 is related to I believe memory issues.

this is likely a hw issue and it is probably best to replace this RP 0.

xander

Marks Maslovs
Level 1
Level 1

Hello,

Thank you for explanation and advises! 

We have opened a TAC case in November last year. For couple of months case was put on hold, as we were restricted to make any works and changes in the network. Through that time TAC was searching and looking for similar problems, there was also some testing done in order to recreate issue, but with no luck. Well, I mean there was  something found more or less similar with problem that we have, but that hits another release, 6.0.1 (if I remember correctly)

Recently, we started to work on this case again. We were provided by new RP, but unfortunately we never made any progress with that, because the new RP was hit by the same problem.

Moreover, we have 2 chassis of 9922, with same configuration, and same SW (5.3.4) experiencing same fault.

SW bug, I am wondering... ?

hi marks,

hmm yeah if there are 2 devices showing the same thing and after a hw swap it shows also, than it is very unlikely to be hw related.

I was zooming in on the interupt code for the FIA: 0x1071000 whereby that "1" bold signifies a crc error. but I now noticed in the drivers that that bit is remapped.

I found your tac case also, let me connect with herve and have this unraveled. 

It could also be that the link between two FPD's is not training right, therefore having issues and with these symptoms as result making us believe there are ecc/crc errors, while it is really a link training issue.

let me confirm that and will report back.

xander

Hello Alexander, 

Much appreciate! Looking forward to hearing from you.

Hi Xander,

I wonder what was the root cause of this issue?
We have a very similar issue on RSP440-TR. Here are some logs.


RP/0/RSP1/CPU0:Oct 11 21:13:07.947 : pfm_node_rp[363]: %PLATFORM-PUNT-2-ILK_IF_FAIL : Set|fiarsp[209000]|0x101b000|interlaken to skytrain 0
RP/0/RSP0/CPU0:Oct 11 21:13:07.971 : FABMGR[220]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/RSP1/CPU0 (slot 5) encountered fatal fabric fault.The card would undergo reload.
RP/0/RSP0/CPU0:Oct 11 21:13:09.977 : pfm_node_rp[363]: %PLATFORM-FABMGR-0-ASIC_ERROR : Set|fabmgr[213093]|0x1033000|Fabmgr encountered fault on standby RSP|Target Node:0/RSP1/CPU0
RP/0/RSP0/CPU0:Oct 11 21:13:10.046 : shelfmgr[410]: %PLATFORM-SHELFMGR-6-NODE_CPU_RESET : Node 0/RSP1/CPU0 CPU reset detected.
RP/0/RSP0/CPU0:Oct 11 21:13:10.047 : shelfmgr[410]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/RSP1/CPU0 A9K-RSP440-TR state:BRINGDOWN

Under show redundancy we have this message:

Active node reload "Cause: dSC node reload is required by install operation"
Standby node reload "Crash Reason: NMI with status=0x70

We are opening a TAC case for this, but maybe you know the reason of this failure.