cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2127
Views
0
Helpful
8
Replies

Fabric Crossbar Errors - Typhoon A9K-24X10GE-TR

Lyphiard
Level 1
Level 1

"I inserted a new A9K-24X10GE-TR line card today, which prompted the following errors:

 

RP/0/RSP1/CPU0:Jul 21 22:26:39.997 : pfm_node_rp[357]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK1 : Set|fab_xbar[213105]|0x1017007|XBAR_0_Slot_3 
RP/0/RSP1/CPU0:Jul 21 22:26:50.015 : pfm_node_rp[357]: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK1 : Clear|fab_xbar[213105]|0x1017007|XBAR_0_Slot_3

This error is repeated several times per hour, each time with ~10 seconds between the "Set" and "Clear" messages.

 

Immediately after "Set" message appears:

PORT    Remote Slot  Remote Inst    Logical ID  Status
======================================================
00      0/3/CPU0            02             1        Up
01      0/3/CPU0            01             1        Up
02      0/3/CPU0            01             0        Up
03      0/3/CPU0            00             0        Up
04      0/3/CPU0            00             1        Up
05      0/3/CPU0            03             1        Up
07      0/RSP1/CPU0         00             1        Up
08      0/3/CPU0            03             0        Up
09      0/RSP0/CPU0         00             1        Down
11      0/RSP1/CPU0         00             0        Up
12      0/RSP0/CPU0         00             0        Up
14      0/RSP0/CPU0         01             1        Up
15      0/RSP1/CPU0         01             1        Up
16      0/RSP0/CPU0         01             0        Up
17      0/RSP1/CPU0         01             0        Up
24      0/3/CPU0            02             0        Up

Immediately after the message clears:

PORT    Remote Slot  Remote Inst    Logical ID  Status
======================================================
00      0/3/CPU0            02             1        Up
01      0/3/CPU0            01             1        Up
02      0/3/CPU0            01             0        Up
03      0/3/CPU0            00             0        Up
04      0/3/CPU0            00             1        Up
05      0/3/CPU0            03             1        Up
07      0/RSP1/CPU0         00             1        Up
08      0/3/CPU0            03             0        Up
09      0/RSP0/CPU0         00             1        Up
11      0/RSP1/CPU0         00             0        Up
12      0/RSP0/CPU0         00             0        Up
14      0/RSP0/CPU0         01             1        Up
15      0/RSP1/CPU0         01             1        Up
16      0/RSP0/CPU0         01             0        Up
17      0/RSP1/CPU0         01             0        Up
24      0/3/CPU0            02             0        Up

 

The line card also appears to be dropping traffic (noticeable drop for users). "show drops" reveals that "Egress Uc dq pkt-len-crc/RO-seq/len error drp" is increasing rapidly.

 

The line card was reseated, rebooted, as well as tested in another slot slot 0 and slot 3). Same issues happen with both slots immediately after the line card boots.

 

I'd assume that this is a bad line card that needs to be RMA'd, but would just like to confirm as that is a last-resort option. Is there any possibility that this error stemmed from an issue with the software, chassis or some underlying fault in one or more RSP's? I've had several (different) other issues with the fabric crossbar / FIA's on trident LC's before, so I'm unsure whether it's just bad luck or if another component in my system is damaged.

 

IOS XR v5.3.4 base, ASR9010 with 2x RSP440-SE. FPD on LC is latest (for this release of IOS XR).

 

8 Replies 8

xr-escalation
Level 1
Level 1
Fabric links are trained at every initialisation, to optimise the setting of programmable ASIC parameters. So this could be a HW fault or a scenario in which SW could do a better job in training the link. Do you have the latest 5.3.4 Service Pack installed? Or the equivalent of individual SMUs? If yes, in that case this is very likely a HW fault and the line card should be replaced.

Hi,

I currently am only running the base version of XR 5.3.4 with no SMU's or SP's installed. I don't want to install any SMU's or SP's unless there is even a remote possibility that it will help.

Is there any way to verify (ie. looking for a specific keyword in a SMU or SP) that a SMU/SP has an "update" specifically for link-training with my applicable line cards? Doing most SMU/SP upgrades requires a router reboot with additional downtime, which I'd like to keep to a minimum at this time.

 

Thanks!

CSCve85121 is one example of a SMU related to fabric monitoring and debugging. I'm fairly sure we posted some others as well. Running a 'vanilla' XR installation, without any SMUs or SPs, is not something we recommend. The concept of the SMU and SP was delivered on IOS XR exactly to help keep the installation up to date with important fixes, without a need to upgrade to a higher IOS XR release. To facilitate the install operation itself, we have delivered the CSM (Cisco Software Manager) Server platform which significantly reduces the complexity of network operator's task. Instead of performing the installation manually, the operator can simply watch the progress and revise the pre- and post-install check logs. There's also an API to customise the pre- and post-install checks to meet your specific deployment. CSM Server is available for download at: https://software.cisco.com/download/home/282414851/type/284777134/release/4.0.

Hi,

For debugging purposes, IOS XR was upgraded to 6.4.2 (latest supported on RSP440) with SP3. However, this error is still present. I assume this means bad hardware.

Is there any way to verify if this error is 100% due to a bad linecard only? What are the possibilities (or debugging options) I can check to ensure that my RSP's or chassis is not bad as well?

Thanks!

The usual HW troubleshooting should help. Visual inspection of connectors on the LC back-end and inside the slot should be carried out to see whether there's any observable physical damage. If none is observed, insert another LC into the slot and see whether the problem persist. Based on what you wrote so far, I expect only the LC to be faulty.

HI  Lyphiard,

 

Good day, Have you resloved this issue? I am facing the same issue now. Could you tell me how to fixed it if you have fixed this issue.

 

Thx

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rps-Cheers | If it solves your problem, please mark as answer. Thanks !

Ended up being a hardware issue. We had to RMA the line card.

Thanks for your reply. So,it looks like a HW issue.Maybe i also need to relplace the bad part LC.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rps-Cheers | If it solves your problem, please mark as answer. Thanks !