"I inserted a new A9K-24X10GE-TR line card today, which prompted the following errors:
RP/0/RSP1/CPU0:Jul 21 22:26:39.997 : pfm_node_rp: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK1 : Set|fab_xbar|0x1017007|XBAR_0_Slot_3 RP/0/RSP1/CPU0:Jul 21 22:26:50.015 : pfm_node_rp: %PLATFORM-CROSSBAR-1-SERDES_ERROR_LNK1 : Clear|fab_xbar|0x1017007|XBAR_0_Slot_3
This error is repeated several times per hour, each time with ~10 seconds between the "Set" and "Clear" messages.
Immediately after "Set" message appears:
PORT Remote Slot Remote Inst Logical ID Status ====================================================== 00 0/3/CPU0 02 1 Up 01 0/3/CPU0 01 1 Up 02 0/3/CPU0 01 0 Up 03 0/3/CPU0 00 0 Up 04 0/3/CPU0 00 1 Up 05 0/3/CPU0 03 1 Up 07 0/RSP1/CPU0 00 1 Up 08 0/3/CPU0 03 0 Up 09 0/RSP0/CPU0 00 1 Down 11 0/RSP1/CPU0 00 0 Up 12 0/RSP0/CPU0 00 0 Up 14 0/RSP0/CPU0 01 1 Up 15 0/RSP1/CPU0 01 1 Up 16 0/RSP0/CPU0 01 0 Up 17 0/RSP1/CPU0 01 0 Up 24 0/3/CPU0 02 0 Up
Immediately after the message clears:
PORT Remote Slot Remote Inst Logical ID Status ====================================================== 00 0/3/CPU0 02 1 Up 01 0/3/CPU0 01 1 Up 02 0/3/CPU0 01 0 Up 03 0/3/CPU0 00 0 Up 04 0/3/CPU0 00 1 Up 05 0/3/CPU0 03 1 Up 07 0/RSP1/CPU0 00 1 Up 08 0/3/CPU0 03 0 Up 09 0/RSP0/CPU0 00 1 Up 11 0/RSP1/CPU0 00 0 Up 12 0/RSP0/CPU0 00 0 Up 14 0/RSP0/CPU0 01 1 Up 15 0/RSP1/CPU0 01 1 Up 16 0/RSP0/CPU0 01 0 Up 17 0/RSP1/CPU0 01 0 Up 24 0/3/CPU0 02 0 Up
The line card also appears to be dropping traffic (noticeable drop for users). "show drops" reveals that "Egress Uc dq pkt-len-crc/RO-seq/len error drp" is increasing rapidly.
The line card was reseated, rebooted, as well as tested in another slot slot 0 and slot 3). Same issues happen with both slots immediately after the line card boots.
I'd assume that this is a bad line card that needs to be RMA'd, but would just like to confirm as that is a last-resort option. Is there any possibility that this error stemmed from an issue with the software, chassis or some underlying fault in one or more RSP's? I've had several (different) other issues with the fabric crossbar / FIA's on trident LC's before, so I'm unsure whether it's just bad luck or if another component in my system is damaged.
IOS XR v5.3.4 base, ASR9010 with 2x RSP440-SE. FPD on LC is latest (for this release of IOS XR).
I currently am only running the base version of XR 5.3.4 with no SMU's or SP's installed. I don't want to install any SMU's or SP's unless there is even a remote possibility that it will help.
Is there any way to verify (ie. looking for a specific keyword in a SMU or SP) that a SMU/SP has an "update" specifically for link-training with my applicable line cards? Doing most SMU/SP upgrades requires a router reboot with additional downtime, which I'd like to keep to a minimum at this time.
For debugging purposes, IOS XR was upgraded to 6.4.2 (latest supported on RSP440) with SP3. However, this error is still present. I assume this means bad hardware.
Is there any way to verify if this error is 100% due to a bad linecard only? What are the possibilities (or debugging options) I can check to ensure that my RSP's or chassis is not bad as well?
The usual HW troubleshooting should help. Visual inspection of connectors on the LC back-end and inside the slot should be carried out to see whether there's any observable physical damage. If none is observed, insert another LC into the slot and see whether the problem persist. Based on what you wrote so far, I expect only the LC to be faulty.
Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.