cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2520
Views
0
Helpful
2
Replies

ASR9K Fabric/Linecard Fault

Lyphiard
Level 1
Level 1

I'm currently operating an ASR9010 on IOS XR v5.3.4 with a A9K-8T-L in LC slot 1 and 2x RSP440-SE in a redundant configuration.

 

A few weeks ago, I noticed the following errors in my log:

pfm_node_rp[357]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Set|fab_xbar[217200]|0x1017000|XBAR_0_Slot_1_FIA_1  
pfm_node_rp[357]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Clear|fab_xbar[217200]|0x1017000|XBAR_0_Slot_1_FIA_1  
pfm_node_rp[357]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Set|fab_xbar[217200]|0x1017000|XBAR_0_Slot_1_FIA_1  
pfm_node_rp[357]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0 : Clear|fab_xbar[217200]|0x1017000|XBAR_0_Slot_1_FIA_1  

I've seen these before, and they appear to be part of a harmless bug in IOS XR (according to the Cisco bug finder tool), so I mostly ignored them.

 

Several days later, the frequency of the "PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK0" message increased, along with a new message that started showing:

FABMGR[220]: %PLATFORM-FABMGR-5-FABRIC_TRANSIENT_FAULT :  Fabric backplane crossbar link underwent link retraining to recover from a transient error: Physical slot 1

 

Originally the frequency of this message was rare, but slowly increased until ~1 msg per 5 seconds or so.

 

Until today, there had been no noticeable traffic forwarding faults. However, this morning, any traffic that traversed the line card had noticeably high trip times along with severe packet loss. After performing a full reboot through power off / power on, traffic forwarding resumed to operate nominally with no error messages being logged and no packet loss.

 

Right before the packet loss began to happen, I noticed the following new message in the syslog:

pfm_node_rp[357]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[213108]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/1/CPU0, 4)
pfm_node_rp[357]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[213108]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/1/CPU0, 4)

Searching through the log, I see this message only for (0/1/CPU0, 4) and (0/1/CPU0, 5), which appear to correlate to bridge #2 FIA #1 on the line card.

 

Does this incident mean I have a faulty or improperly seated linecard, or is this simply a software bug that may have been fixed during the reboot? The ASR9K operated normally without errors for about 10 weeks before this happened. While the router was powered off (during the reboot process), linecard 1 was reseated in the chassis as well.

2 Replies 2

Aleksandar Vidakovic
Cisco Employee
Cisco Employee

Poorly seated line card could cause this kind of symptom, but it's also possible that ASIC programming wasn't optimal. To rule out the latter, please make sure to load the most recent 5.3.4 Service Pack or the equivalent SMUs.

 

If that is already taken care of, use "sh drops all location <location>" to see whether any CRC drops are reported by the FIA. If CRC errors continue incrementing, there could be an underlying HW failure. If the LC is moved to a free slot and the errors follow the LC, then the LC should be replaced. If the errors follow the slot, the chassis should be replaced.

 

Hope the helps,

/Aleksandar

xr-escalation
Level 1
Level 1
Hi Lyphiard

There is high possibility hw is having some faults which might get clear after re seat of LC.
You can open the service request on Cisco and better to get it troubleshooted.

Thanks,
Hitesh