Help Troubleshooting Fabric errors on an ASR 9906

ThomasD86 · ‎12-26-2021

Hi,
as the title says, since yesterday we've been experiencing a lot of fabric related errors.
This is our router configuration:

0/RSP0/CPU0       A9K-RSP5-SE(Standby)       IOS XR RUN        NSHUT
0/RSP1/CPU0       A9K-RSP5-SE(Active)        IOS XR RUN        NSHUT
0/FT0             ASR-9906-FAN               OPERATIONAL       NSHUT
0/FT1             ASR-9906-FAN               OPERATIONAL       NSHUT
0/0/CPU0          A9K-48X10GE-1G-SE          IOS XR RUN        NSHUT
0/1/CPU0          A9K-8HG-FLEX-SE            IOS XR RUN        NSHUT
0/FC0             A99-SFC3-T                 OPERATIONAL       NSHUT
0/FC2             A99-SFC3-T                 OPERATIONAL       NSHUT
0/FC4             A99-SFC3-T                 POWERED_OFF       NSHUT
0/PT0             A9K-DC-PEM-V3              OPERATIONAL       NSHUT

Ever since yesterday we're getting this kind of output from the show log:

RP/0/RSP1/CPU0:2021 Dec 26 07:47:13.009 : fab_xbar_sp4[374]: %PLATFORM-CIH-5-ASIC_ERROR_THRESHOLD : fc4xbar[0]: An interface-err error has occurred causing  packet drop transient. ibbReg5.ibbExceptionHier.ibbReg5.ibbExceptionLeaf0.intIpcFnc1McRuntErr  Threshold has been exceeded 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:17.729 : fab_xbar_sp4[374]: %PLATFORM-CIH-3-ASIC_ERROR_SPECIAL_HANDLE_THRESH : fc4xbar[0]: A link-err error has occurred causing  packet drop transient. cflReg5.cflExceptionHier.cflReg5.cflExceptionLeaf0.intCflPiDemuxP0SopEbb  Threshold has been exceeded 
LC/0/1/CPU0:2021 Dec 26 07:47:17.942 : pfm_node_lc[132]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Clear|fab_xbar[5247]|0x1017024|Slot_10_XBAR_0 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : fab_xbar_sp4[374]: %PLATFORM-CIH-0-ASIC_ERROR_RESET_THRESH_CROSS : fc4xbar[0]: Reset threshold is crossed 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : fab_xbar_sp4[374]: %PLATFORM-CIH-5-ASIC_ERROR_HARD_RESET_START : fc4xbar[0]: HARD_RESET needed 0x36002064 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : fab_xbar_sp4[374]: %PLATFORM-CIH-5-ASIC_ERROR_RESET_THRESH_CROSS_NOTIFICATION : fc4xbar[0]: notification of exceeded reset threshold is sent to the driver 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ASIC_FATAL_ERR : Set|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.268 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT :  0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated.  
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.268 : pfm_node_rp[222]: %PLATFORM-FABMGR-0-SPINE_SHUTDOWN : Set|fabmgr[5162]|0x1034000|Fabmgr encountered fault on Fabric card. Spine reloaded|Target Node:0/FC4 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.272 : pfm_node_rp[222]: %PLATFORM-FABMGR-0-SPINE_SHUTDOWN : Clear|fabmgr[5162]|0x1034000|Fabmgr encountered fault on Fabric card. Spine reloaded|Target Node:0/FC4 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.272 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ASIC_FATAL_ERR : Clear|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 
0/RSP0/ADMIN0:2021 Dec 26 07:47:20.267 : shelf_mgr[4094]: %INFRA-SHELF_MGR-3-FAULT_ACTION_CARD_RELOAD : Graceful reload requested for card 0/FC4. Reason: Card reset requested by XR: Process ID: 5162 (fabmgr), Target node: 0/FC4, CondID: 8705  
RP/0/RSP1/CPU0:2021 Dec 26 07:47:21.673 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ACCESS_FAILURE : Set|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 
RP/0/RSP1/CPU0:2021 Dec 26 07:47:25.914 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Clear|fab_xbar_sp4[5187]|0x101700a|XBAR_0_Slot_3 
0/RSP0/ADMIN0:2021 Dec 26 07:47:30.268 : shelf_mgr[4094]: %INFRA-SHELF_MGR-4-CARD_RELOAD : Reloading card 0/FC4  
0/RSP0/ADMIN0:2021 Dec 26 07:47:34.656 : canbus_driver[4062]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/FC4 CBC-0, reset reason CPU_RESET_PWROFF (0x0a000000)   
0/RSP0/ADMIN0:2021 Dec 26 07:47:34.657 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_RESET, event_reason_str 'HW Event RESET' for card 0/FC4  
RP/0/RSP1/CPU0:2021 Dec 26 07:47:34.666 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ACCESS_FAILURE : Clear|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 
0/RSP0/ADMIN0:2021 Dec 26 07:47:43.139 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_POWERED_OFF, event_reason_str 'HW Event Powered OFF' for card 0/FC4  
0/RSP0/ADMIN0:2021 Dec 26 07:47:49.970 : canbus_driver[4062]: %PLATFORM-CANB_SERVER-7-CBC_POST_RESET_NOTIFICATION : Node 0/FC4 CBC-0, reset reason CPU_RESET_POR (0x05000000)   
0/RSP0/ADMIN0:2021 Dec 26 07:47:53.147 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_POWERED_ON, event_reason_str 'HW Event POWERED ON' for card 0/FC4  
0/RSP0/ADMIN0:2021 Dec 26 07:48:21.172 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_OK, event_reason_str 'HW Event OK' for card 0/FC4  
0/RSP0/ADMIN0:2021 Dec 26 07:48:21.172 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-CARD_HW_OPERATIONAL : Card: 0/FC4 hardware state going to Operational

This causes the 100gb interface to be shut and since its our backbone link, it causes service disruption for the customers that are connected to that router. One of the first thing we tried was to remove the A9K-8HG-FLEX-SE card from the slot 0/1 and we noticed that at that point the xbar error stopped. We then inserted the car in slot 0/2 of the router and as soon as the card booted back up, fabric errors started reappearing. Since the errors "followed the LC" it was deemed to be the component at fault and thus, it was swapped with a new one, to no avail. We swapped it with another one and even then, no dice. As soon as we insert a A9K-8HG-FLEX-SE the router we start getting fabric error which eventually result in FC 0/4 being placed in a powered_off state.

Additionally this seems to affect operations on the LC in slot 0/0 as well. We have several l2vpn services configured under the interfaces of that card, they all work minus the ones under a specific physical interface, namely the Tengig0/0/0/17 of the card do not work. The status of the EVCs is UP but, no traffic flows.

When this problem first happened it was resoved by forcing a switchover to the secondary RSP, this somehow caused traffic to start flowing again. But, the error has since reoccurred even after the switchover, apparently the 100g card was re-enabled, so my best guess right now is that the constant stream of errors causes some process to get stuck and thus traffic flows stops. (This is an uneducated guess, might very well be wrong)

If we shut the slot in which the A9K-8HG-FLEX-SE card resides, the fabric errors stop from happening but, we're left without a 100gb uplink obviously. At this point I am not quite sure what is the component at fault, to sum it up:

-3 Different A9K-8HG-FLEX-SE all result in fabric errors being generated, regardless of the slot the card is positioned in

-Removal of the A9K-8HG-FLEX-SE card or, disabling the slot in which the card is located, stops the errors.

-As long as the A9K-8HG-FLEX-SE is in the "Operational" state, fabric errors will show up regardless of which RSP card is currently active.

At this point I am not sure which of the components is at fault, is it the A9K-8HG-FLEX-SE card? the RSP? The fabric card in the slot 0/FC4?

Every component seems to work well by itself and the logs seem to suggest that the fault is with the A9K-8HG-FLEX-SE card:

fm_node_rp[222]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[8965]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/1/CPU0, 0)

But again, we tried 3 LC of this type at this point and so it seems unlikely all 3 are broken. Again these errors start to happen as soon as the card is in, regardless of the slot in which it is inserted.

I am at a loss here, and don't know exactly what might be causing the fault. Could anyone provide any further insight?

Thank you

tkarnani · ‎12-26-2021

Can we move FC4 to a different slot? or swap FC4 with FC0 to see if the problem follows the slot itself or the card?

thank you

ThomasD86 · ‎12-27-2021

Hi tkarnani,

thank you for your reply. Moving around the Fabric Card was an idea that I considered but I was afraid to do so. Reason for it is that we have around 15k customers on that router.
I have no experience on this but I think that removing a fabric card will obviously reduce the router switching throughput so if that causes some service disruption for them, that's a risk we cannot take but I suppose there's no way of knowing beforehand how much disruption that's gonna cause if any.
Yesterday morning I've shut down slot 0/1 where the A9K-8HG-FLEX-SE card resides and, restarted slot 0/FC4 that was powered off by the router. For a day and a half now, there have been no fabric errors in the logs.
Could it be that if the fabric card is at fault the problem only shows once the 100gb card is enabled?

tkarnani · ‎12-27-2021

This error.

fm_node_rp[222]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[8965]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/1/CPU0, 0)

there is a keepalive between the RSP and each Network process on every line card. the RSP will send this keepalive the NP will need to sent it back to the RSP.

when this times out the error is set.

its usually a fault on the NP itself, or a problem in the path from the NP to the RSP.

so RSP <> fabric card <> Line card.

if the line card is down, the keepalives wont be sent/processes

thanks

here is a general document on the process

https://www.cisco.com/c/en/us/support/docs/routers/asr-9000-series-aggregation-services-routers/116727-troubleshoot-punt-00.html

ThomasD86 · ‎12-27-2021

Hi,

thanks for the document I've read it all and it helped some.
However this part is still a bit unclear, still from the logs:

RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.268 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT :  0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated.  
RP/0/RSP1/CPU0:2021 Dec 26 07:51:36.453 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT :  0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated.  
RP/0/RSP1/CPU0:2021 Dec 26 07:57:23.495 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_INTERNAL_FAULT :  0/1/CPU0 (slot 3) encountered fabric fault. Interfaces are going to be shutdown.  
RP/0/RSP1/CPU0:2021 Dec 26 07:59:45.455 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT :  0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated.  
RP/0/RSP1/CPU0:2021 Dec 26 08:01:20.235 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_INTERNAL_FAULT :  0/1/CPU0 (slot 3) encountered fabric fault. Interfaces are going to be shutdown.  
RP/0/RSP1/CPU0:2021 Dec 26 08:04:20.564 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT :  0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated.

The router is telling us that there's a fault both on the fabric card in the slot 0/FC4 and the LC in slot 0/1? I assume the slot 3 and slot 10 between parenthesis are some kind of internal location on the card. Are the error consequential one to other?

On the second line, we have a fabric error at 07:51 on the 0/FC4 card which causes the router to deactivate it, at 07:57 the interfaces on the LC are placed in shutdown for a fabric error. Is this second error a consequence of the first one? I'd say yes but then again if the two things were related I'd expect it to happen seconds after.

Additionally there's this other output in the logs:

RP/0/RSP1/CPU0:2021 Dec 26 08:55:45.977 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95])(0/1/CPU0, 1, [66, 68, 69, 70, 71, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 10:00:45.097 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 11:05:41.217 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 12:10:37.337 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 13:15:33.459 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 14:20:27.089 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 15:25:19.208 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 73, 74, 76]) 
RP/0/RSP1/CPU0:2021 Dec 26 16:46:06.028 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [26, 29, 30, 32, 33, 36, 37, 38])(0/1/CPU0, 1, [24, 26, 28, 29, 30, 33, 34, 36])

Does this seems to indicate problems with both of the slices on the 0/1 LC
Additionally reading from the document you've linked it appeared that these keepalive packets are sent by both RSPs towards the LC so I would expect to see these errors in the logs doubled (one from the active and one from the standby RSP) but I only ever see those from the active RSP.

tkarnani · ‎12-29-2021

we still need to move the card to determine if its the slot or the fabric card itself.

once the fabric card fails, fabric capacity drops, the fia on the LC will shutdown as there is not enough capacity to send traffic, once the fia shuts down the interfaces mapped to that fia will be down "show controller np ports all location 0/1/cpu0" to see the mapping

there is a logical slot of 0/1/cpu0 and physical chassis slot of 3, "show platform summary location all" will show you the mapping

RSP1 is not getting those diagnostic packets from LC1, both NP0 and NP1. RSP1 will be sending the packets through the fabric card. we will need to swap/move it to determine if its the slot or the card itself

thanks

ThomasD86 · ‎12-30-2021

Thank you, this is clear now.

I have been reading this document here:
https://www.cisco.com/c/en/us/support/docs/routers/asr-9000-series-aggregation-services-routers/117718-technote-asr9000-00.html#anc5
According to the paragraph “Fabric Cards Requirements” the formula used to calculate the amount of FC needed is the following:

In order to calculate the minimum number of FCs needed for a particular LC, use this formula:
(num_ports_used*port_bandwidth)/(FC_bandwidth)
In the case of the 36x10 GigE card with 30 ports this is (30*10)/(110)=2.72 FCs, or three FCs rounded up.
In order to calculate n+1 redundnacy, use this formula:
(num_ports_used*port_bandwidth)/(FC_bandwidth) + 1

In our 9906 we run 3 A99-SFC-T each one has a 600gb fabric capacity so that would give our setup a total throughput of 1.8tbs, with a failed card, we're down to 1.2tbs. RSP5 should have integrated fabric and also work as Fabric card but I was unable to find how much switching fabric capacity they add so I am not going to include them in the calculations.

Back when the fault occurred, the router had active (for active I intend interfaces that were either in "Up" or "Down" status) 20 TenGig interfaces and 2 1gbit ones on the 0/0 LC and, 2 100gb interfaces active on the 0/1 linecard. So using the formula in the document above the calculation should be something like this:

[(20*10)+(1*2)+(100*2)]/600 = 402/600 = 0.67 which rounds up to 1 fabric card +1 for redundancy. Even if I assume that the document has a typo in it and we're meant to calculate the interface according to their full duplex speed I have 804/600=1.34 so 2 fabric cards +1 for redundancy.

In both cases, seems that 2 fabric cards should have been enough to handle the interfaces active on the router but, practice shows it wasn't the case. Where I am wrong with my math?

tkarnani · ‎12-30-2021

You have enough fabric bandwidth to support the card if a fabric card fails

RSP5 is listed here slide 9 https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2019/pdf/BRKARC-2003.pdf

the challenge here is the system has detected a fabric fault, if the fabric interface takes errors or needs to shut down, then the interfaces associated with the fabric interface will be shut down as a precaution

%PLATFORM-FABMGR-2-FABRIC_INTERNAL_FAULT

if the total fabric capacity were to drop below the recommended, the system would apply an egress rate limiter

slide 37/38

you would see these messages

LC/0/3/CPU0: pfm_node_lc[261]: %FABRIC-FIA-1-RATE_LIMITER_ON : Set|fialc[4795]|0x108a000|Insufficient fabric capacity for card types in use - FIA egress rate limiter applied
LC/0/5/CPU0:pfm_node_lc[207]: %FABRIC-FIA-1-RATE_LIMITER_ON : Set|fialc[4798]|0x108a000|Insufficient fabric capacity for card types in use - F

Thanks

https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2020/pdf/BRKARC-2003.pdf