12-26-2021 07:43 AM - edited 12-26-2021 07:45 AM
Hi,
as the title says, since yesterday we've been experiencing a lot of fabric related errors.
This is our router configuration:
0/RSP0/CPU0 A9K-RSP5-SE(Standby) IOS XR RUN NSHUT 0/RSP1/CPU0 A9K-RSP5-SE(Active) IOS XR RUN NSHUT 0/FT0 ASR-9906-FAN OPERATIONAL NSHUT 0/FT1 ASR-9906-FAN OPERATIONAL NSHUT 0/0/CPU0 A9K-48X10GE-1G-SE IOS XR RUN NSHUT 0/1/CPU0 A9K-8HG-FLEX-SE IOS XR RUN NSHUT 0/FC0 A99-SFC3-T OPERATIONAL NSHUT 0/FC2 A99-SFC3-T OPERATIONAL NSHUT 0/FC4 A99-SFC3-T POWERED_OFF NSHUT 0/PT0 A9K-DC-PEM-V3 OPERATIONAL NSHUT
Ever since yesterday we're getting this kind of output from the show log:
RP/0/RSP1/CPU0:2021 Dec 26 07:47:13.009 : fab_xbar_sp4[374]: %PLATFORM-CIH-5-ASIC_ERROR_THRESHOLD : fc4xbar[0]: An interface-err error has occurred causing packet drop transient. ibbReg5.ibbExceptionHier.ibbReg5.ibbExceptionLeaf0.intIpcFnc1McRuntErr Threshold has been exceeded RP/0/RSP1/CPU0:2021 Dec 26 07:47:17.729 : fab_xbar_sp4[374]: %PLATFORM-CIH-3-ASIC_ERROR_SPECIAL_HANDLE_THRESH : fc4xbar[0]: A link-err error has occurred causing packet drop transient. cflReg5.cflExceptionHier.cflReg5.cflExceptionLeaf0.intCflPiDemuxP0SopEbb Threshold has been exceeded LC/0/1/CPU0:2021 Dec 26 07:47:17.942 : pfm_node_lc[132]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Clear|fab_xbar[5247]|0x1017024|Slot_10_XBAR_0 RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : fab_xbar_sp4[374]: %PLATFORM-CIH-0-ASIC_ERROR_RESET_THRESH_CROSS : fc4xbar[0]: Reset threshold is crossed RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : fab_xbar_sp4[374]: %PLATFORM-CIH-5-ASIC_ERROR_HARD_RESET_START : fc4xbar[0]: HARD_RESET needed 0x36002064 RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : fab_xbar_sp4[374]: %PLATFORM-CIH-5-ASIC_ERROR_RESET_THRESH_CROSS_NOTIFICATION : fc4xbar[0]: notification of exceeded reset threshold is sent to the driver RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.266 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ASIC_FATAL_ERR : Set|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.268 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated. RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.268 : pfm_node_rp[222]: %PLATFORM-FABMGR-0-SPINE_SHUTDOWN : Set|fabmgr[5162]|0x1034000|Fabmgr encountered fault on Fabric card. Spine reloaded|Target Node:0/FC4 RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.272 : pfm_node_rp[222]: %PLATFORM-FABMGR-0-SPINE_SHUTDOWN : Clear|fabmgr[5162]|0x1034000|Fabmgr encountered fault on Fabric card. Spine reloaded|Target Node:0/FC4 RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.272 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ASIC_FATAL_ERR : Clear|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 0/RSP0/ADMIN0:2021 Dec 26 07:47:20.267 : shelf_mgr[4094]: %INFRA-SHELF_MGR-3-FAULT_ACTION_CARD_RELOAD : Graceful reload requested for card 0/FC4. Reason: Card reset requested by XR: Process ID: 5162 (fabmgr), Target node: 0/FC4, CondID: 8705 RP/0/RSP1/CPU0:2021 Dec 26 07:47:21.673 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ACCESS_FAILURE : Set|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 RP/0/RSP1/CPU0:2021 Dec 26 07:47:25.914 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Clear|fab_xbar_sp4[5187]|0x101700a|XBAR_0_Slot_3 0/RSP0/ADMIN0:2021 Dec 26 07:47:30.268 : shelf_mgr[4094]: %INFRA-SHELF_MGR-4-CARD_RELOAD : Reloading card 0/FC4 0/RSP0/ADMIN0:2021 Dec 26 07:47:34.656 : canbus_driver[4062]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/FC4 CBC-0, reset reason CPU_RESET_PWROFF (0x0a000000) 0/RSP0/ADMIN0:2021 Dec 26 07:47:34.657 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_RESET, event_reason_str 'HW Event RESET' for card 0/FC4 RP/0/RSP1/CPU0:2021 Dec 26 07:47:34.666 : pfm_node_rp[222]: %PLATFORM-CROSSBAR-2-ACCESS_FAILURE : Clear|fab_xbar_sp4[5187]|0x1017000|0 on XBAR_0_Slot_10 0/RSP0/ADMIN0:2021 Dec 26 07:47:43.139 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_POWERED_OFF, event_reason_str 'HW Event Powered OFF' for card 0/FC4 0/RSP0/ADMIN0:2021 Dec 26 07:47:49.970 : canbus_driver[4062]: %PLATFORM-CANB_SERVER-7-CBC_POST_RESET_NOTIFICATION : Node 0/FC4 CBC-0, reset reason CPU_RESET_POR (0x05000000) 0/RSP0/ADMIN0:2021 Dec 26 07:47:53.147 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_POWERED_ON, event_reason_str 'HW Event POWERED ON' for card 0/FC4 0/RSP0/ADMIN0:2021 Dec 26 07:48:21.172 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_OK, event_reason_str 'HW Event OK' for card 0/FC4 0/RSP0/ADMIN0:2021 Dec 26 07:48:21.172 : shelf_mgr[4094]: %INFRA-SHELF_MGR-6-CARD_HW_OPERATIONAL : Card: 0/FC4 hardware state going to Operational
This causes the 100gb interface to be shut and since its our backbone link, it causes service disruption for the customers that are connected to that router. One of the first thing we tried was to remove the A9K-8HG-FLEX-SE card from the slot 0/1 and we noticed that at that point the xbar error stopped. We then inserted the car in slot 0/2 of the router and as soon as the card booted back up, fabric errors started reappearing. Since the errors "followed the LC" it was deemed to be the component at fault and thus, it was swapped with a new one, to no avail. We swapped it with another one and even then, no dice. As soon as we insert a A9K-8HG-FLEX-SE the router we start getting fabric error which eventually result in FC 0/4 being placed in a powered_off state.
Additionally this seems to affect operations on the LC in slot 0/0 as well. We have several l2vpn services configured under the interfaces of that card, they all work minus the ones under a specific physical interface, namely the Tengig0/0/0/17 of the card do not work. The status of the EVCs is UP but, no traffic flows.
When this problem first happened it was resoved by forcing a switchover to the secondary RSP, this somehow caused traffic to start flowing again. But, the error has since reoccurred even after the switchover, apparently the 100g card was re-enabled, so my best guess right now is that the constant stream of errors causes some process to get stuck and thus traffic flows stops. (This is an uneducated guess, might very well be wrong)
If we shut the slot in which the A9K-8HG-FLEX-SE card resides, the fabric errors stop from happening but, we're left without a 100gb uplink obviously. At this point I am not quite sure what is the component at fault, to sum it up:
-3 Different A9K-8HG-FLEX-SE all result in fabric errors being generated, regardless of the slot the card is positioned in
-Removal of the A9K-8HG-FLEX-SE card or, disabling the slot in which the card is located, stops the errors.
-As long as the A9K-8HG-FLEX-SE is in the "Operational" state, fabric errors will show up regardless of which RSP card is currently active.
At this point I am not sure which of the components is at fault, is it the A9K-8HG-FLEX-SE card? the RSP? The fabric card in the slot 0/FC4?
Every component seems to work well by itself and the logs seem to suggest that the fault is with the A9K-8HG-FLEX-SE card:
fm_node_rp[222]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[8965]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/1/CPU0, 0)
But again, we tried 3 LC of this type at this point and so it seems unlikely all 3 are broken. Again these errors start to happen as soon as the card is in, regardless of the slot in which it is inserted.
I am at a loss here, and don't know exactly what might be causing the fault. Could anyone provide any further insight?
Thank you
12-26-2021 08:14 AM
Can we move FC4 to a different slot? or swap FC4 with FC0 to see if the problem follows the slot itself or the card?
thank you
12-27-2021 06:52 AM
Hi tkarnani,
thank you for your reply. Moving around the Fabric Card was an idea that I considered but I was afraid to do so. Reason for it is that we have around 15k customers on that router.
I have no experience on this but I think that removing a fabric card will obviously reduce the router switching throughput so if that causes some service disruption for them, that's a risk we cannot take but I suppose there's no way of knowing beforehand how much disruption that's gonna cause if any.
Yesterday morning I've shut down slot 0/1 where the A9K-8HG-FLEX-SE card resides and, restarted slot 0/FC4 that was powered off by the router. For a day and a half now, there have been no fabric errors in the logs.
Could it be that if the fabric card is at fault the problem only shows once the 100gb card is enabled?
12-27-2021 07:03 AM - edited 12-27-2021 07:03 AM
This error.
fm_node_rp[222]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[8965]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/1/CPU0, 0)
there is a keepalive between the RSP and each Network process on every line card. the RSP will send this keepalive the NP will need to sent it back to the RSP.
when this times out the error is set.
its usually a fault on the NP itself, or a problem in the path from the NP to the RSP.
so RSP <> fabric card <> Line card.
if the line card is down, the keepalives wont be sent/processes
thanks
here is a general document on the process
12-27-2021 11:34 AM
Hi,
thanks for the document I've read it all and it helped some.
However this part is still a bit unclear, still from the logs:
RP/0/RSP1/CPU0:2021 Dec 26 07:47:20.268 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated. RP/0/RSP1/CPU0:2021 Dec 26 07:51:36.453 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated. RP/0/RSP1/CPU0:2021 Dec 26 07:57:23.495 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_INTERNAL_FAULT : 0/1/CPU0 (slot 3) encountered fabric fault. Interfaces are going to be shutdown. RP/0/RSP1/CPU0:2021 Dec 26 07:59:45.455 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated. RP/0/RSP1/CPU0:2021 Dec 26 08:01:20.235 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_INTERNAL_FAULT : 0/1/CPU0 (slot 3) encountered fabric fault. Interfaces are going to be shutdown. RP/0/RSP1/CPU0:2021 Dec 26 08:04:20.564 : FABMGR[460]: %PLATFORM-FABMGR-2-FABRIC_SPINE_FAULT : 0/FC4 (slot 10) encountered fabric fault. Fabric on this card is deactivated.
The router is telling us that there's a fault both on the fabric card in the slot 0/FC4 and the LC in slot 0/1? I assume the slot 3 and slot 10 between parenthesis are some kind of internal location on the card. Are the error consequential one to other?
On the second line, we have a fabric error at 07:51 on the 0/FC4 card which causes the router to deactivate it, at 07:57 the interfaces on the LC are placed in shutdown for a fabric error. Is this second error a consequence of the first one? I'd say yes but then again if the two things were related I'd expect it to happen seconds after.
Additionally there's this other output in the logs:
RP/0/RSP1/CPU0:2021 Dec 26 08:55:45.977 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95])(0/1/CPU0, 1, [66, 68, 69, 70, 71, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 10:00:45.097 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 11:05:41.217 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 12:10:37.337 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 13:15:33.459 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 83, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 71, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 14:20:27.089 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 15:25:19.208 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [82, 85, 86, 87, 88, 89, 92, 93, 94, 95])(0/1/CPU0, 1, [64, 66, 68, 69, 70, 73, 74, 76]) RP/0/RSP1/CPU0:2021 Dec 26 16:46:06.028 : online_diag_rsp[251]: %PLATFORM-ONLINE_DIAG-3-PUNT_FABRIC_DATA_PATH_FAILED : PuntFabricDataPath test failure detected, detail in the form of (slot, NP, [VQI's]): (0/1/CPU0, 0, [26, 29, 30, 32, 33, 36, 37, 38])(0/1/CPU0, 1, [24, 26, 28, 29, 30, 33, 34, 36])
Does this seems to indicate problems with both of the slices on the 0/1 LC
Additionally reading from the document you've linked it appeared that these keepalive packets are sent by both RSPs towards the LC so I would expect to see these errors in the logs doubled (one from the active and one from the standby RSP) but I only ever see those from the active RSP.
12-29-2021 04:41 AM
we still need to move the card to determine if its the slot or the fabric card itself.
once the fabric card fails, fabric capacity drops, the fia on the LC will shutdown as there is not enough capacity to send traffic, once the fia shuts down the interfaces mapped to that fia will be down "show controller np ports all location 0/1/cpu0" to see the mapping
there is a logical slot of 0/1/cpu0 and physical chassis slot of 3, "show platform summary location all" will show you the mapping
RSP1 is not getting those diagnostic packets from LC1, both NP0 and NP1. RSP1 will be sending the packets through the fabric card. we will need to swap/move it to determine if its the slot or the card itself
thanks
12-30-2021 10:48 AM
Thank you, this is clear now.
I have been reading this document here:
https://www.cisco.com/c/en/us/support/docs/routers/asr-9000-series-aggregation-services-routers/117718-technote-asr9000-00.html#anc5
According to the paragraph “Fabric Cards Requirements” the formula used to calculate the amount of FC needed is the following:
In order to calculate the minimum number of FCs needed for a particular LC, use this formula:
(num_ports_used*port_bandwidth)/(FC_bandwidth)
In the case of the 36x10 GigE card with 30 ports this is (30*10)/(110)=2.72 FCs, or three FCs rounded up.
In order to calculate n+1 redundnacy, use this formula:
(num_ports_used*port_bandwidth)/(FC_bandwidth) + 1
In our 9906 we run 3 A99-SFC-T each one has a 600gb fabric capacity so that would give our setup a total throughput of 1.8tbs, with a failed card, we're down to 1.2tbs. RSP5 should have integrated fabric and also work as Fabric card but I was unable to find how much switching fabric capacity they add so I am not going to include them in the calculations.
Back when the fault occurred, the router had active (for active I intend interfaces that were either in "Up" or "Down" status) 20 TenGig interfaces and 2 1gbit ones on the 0/0 LC and, 2 100gb interfaces active on the 0/1 linecard. So using the formula in the document above the calculation should be something like this:
[(20*10)+(1*2)+(100*2)]/600 = 402/600 = 0.67 which rounds up to 1 fabric card +1 for redundancy. Even if I assume that the document has a typo in it and we're meant to calculate the interface according to their full duplex speed I have 804/600=1.34 so 2 fabric cards +1 for redundancy.
In both cases, seems that 2 fabric cards should have been enough to handle the interfaces active on the router but, practice shows it wasn't the case. Where I am wrong with my math?
12-30-2021 11:04 AM
You have enough fabric bandwidth to support the card if a fabric card fails
RSP5 is listed here slide 9 https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2019/pdf/BRKARC-2003.pdf
the challenge here is the system has detected a fabric fault, if the fabric interface takes errors or needs to shut down, then the interfaces associated with the fabric interface will be shut down as a precaution
%PLATFORM-FABMGR-2-FABRIC_INTERNAL_FAULT
if the total fabric capacity were to drop below the recommended, the system would apply an egress rate limiter
slide 37/38
you would see these messages
LC/0/3/CPU0: pfm_node_lc[261]: %FABRIC-FIA-1-RATE_LIMITER_ON : Set|fialc[4795]|0x108a000|Insufficient fabric capacity for card types in use - FIA egress rate limiter applied
LC/0/5/CPU0:pfm_node_lc[207]: %FABRIC-FIA-1-RATE_LIMITER_ON : Set|fialc[4798]|0x108a000|Insufficient fabric capacity for card types in use - F
Thanks
https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2020/pdf/BRKARC-2003.pdf
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide