12-04-2020 10:18 PM
Hello,
Lately, I've seen two outages (both on ASR 9901/Starlord) with external TCAM error either causing LC0/0/CPU0 to reboot, or both NP0 and NP1 to completely lock up, destroying all forwarding traffic until 0/0/CPU0 is manually rebooted or remote hands power cycles the router chassis:
ERROR! 0x80001755 EZprmCAMC_ExtTCAMCommand: command fails due to EZprmCAMC_TCAM_CONTROL_INTERFACE_OR_DEVICE_ERR in file 'drivers/chips/np/ezchip-5c/src/host/driver/src/prm/chn/EZprmCAMC.c' line 2493
Is there a bad batch of external TCAM shipping with some ASR 9901s? Should we open a TAC case for HW RMA, or is this one of those freak situations that can be completely resolved by SW update?
DDTS CSCvs36064 is very vague -- it claims issue is "very rare" and suggests SW correction, but it also states it is "not a SW issue", which seems to point to possible bad TCAM memory/hardware failure?
Solved! Go to Solution.
12-05-2020 10:47 AM - edited 12-05-2020 10:49 AM
Hi James,
yes the smus and preq's can get complicated. there is a tool called CSM which can help alleviate this problem.
the interface creation was just an example i gave of TCAM use. the main issue is the interrupt/fault not being corrected once hit. i work in TAC, a similar case i had where all the bundle members of a particular card went down due to this issue.
all the bundle members went down, never recovered until the LC was reloaded
the software fix will help recover when this fault is detected. i would install the smu when able to ensure you dont feel this pain again
Thanks
12-05-2020 05:33 AM
Hey!
Ufortunately I have never seen this before.
I would generally suggest to open a TAC Case and let Cisco advise here.
Best regards
Juls
12-05-2020 09:43 AM - edited 12-05-2020 09:47 AM
Hi,
the TCAM is not necessarily faulty. the resources are used for interface creation. if this process fails we will see those uidb drops.
the smu that is available in 6.5.3 is to correct this by retrying, or in worse case resetting the NP.
the software will correct this issue from occurring, if it cannot it will try to restore . the challenge with this bug, is once it is hit. we do not get any real alert only NP drops while the interfaces start black holing traffic.
i would recommend to install the smu if you have 6.4.2 6.5.3 or 6.6.2.
the engineering team is working on releasing one for 7.0.2 as well however it is not posted yet.
Thanks
12-05-2020 09:50 AM - edited 12-05-2020 09:51 AM
Thank you for the detailed information! I'll take a look at the SMU (though 6.5.3 CSCvs36064 SMU Readme lists 5 prerequisites, where they themselves also have prerequisites, it appears to have complex dependencies).
The second question that I have -- is it normal for this to occur when nobody was working on the router (it was just forwarding traffic in production, then stopped forwarding until 0/0/CPU0 is manually reloaded). Meaning, we didn't have anyone log into the device to create any new interfaces or make any config changes that may have caused interface lists to change. The device is doing LER/LSR duties -- it's not running BNG either that may have created or edited interfaces.
Thanks!
12-05-2020 10:47 AM - edited 12-05-2020 10:49 AM
Hi James,
yes the smus and preq's can get complicated. there is a tool called CSM which can help alleviate this problem.
the interface creation was just an example i gave of TCAM use. the main issue is the interrupt/fault not being corrected once hit. i work in TAC, a similar case i had where all the bundle members of a particular card went down due to this issue.
all the bundle members went down, never recovered until the LC was reloaded
the software fix will help recover when this fault is detected. i would install the smu when able to ensure you dont feel this pain again
Thanks
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide