cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements
Join Customer Connection to register!
535
Views
0
Helpful
4
Replies
James Jun
Beginner

ASR 9901 with bad External TCAMs?

Hello,

 

Lately, I've seen two outages (both on ASR 9901/Starlord) with external TCAM error either causing LC0/0/CPU0 to reboot, or both NP0 and NP1 to completely lock up, destroying all forwarding traffic until 0/0/CPU0 is manually rebooted or remote hands power cycles the router chassis:

 

ERROR! 0x80001755 EZprmCAMC_ExtTCAMCommand: command fails due to EZprmCAMC_TCAM_CONTROL_INTERFACE_OR_DEVICE_ERR in file 'drivers/chips/np/ezchip-5c/src/host/driver/src/prm/chn/EZprmCAMC.c' line 2493

 

 

Is there a bad batch of external TCAM shipping with some ASR 9901s?  Should we open a TAC case for HW RMA, or is this one of those freak situations that can be completely resolved by SW update?

 

DDTS CSCvs36064 is very vague -- it claims issue is "very rare" and suggests SW correction, but it also states it is "not a SW issue", which seems to point to possible bad TCAM memory/hardware failure?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi James,

 

yes the smus and preq's can get complicated. there is a tool called CSM which can help alleviate this problem.

 

the interface creation was just an example i gave of TCAM use. the main issue is the interrupt/fault not being corrected once hit. i work in TAC, a similar case i had where all the bundle members of a particular card went down due to this issue.

 

all the bundle members went down, never recovered until the LC was reloaded

 

the software fix will help recover when this fault is detected. i would install the smu when able to ensure you dont feel this pain again


Thanks

 

 

View solution in original post

4 REPLIES 4
julian.bendix
Participant

Hey!

Ufortunately I have never seen this before.

I would generally suggest to open a TAC Case and let Cisco advise here.

Best regards
Juls

tkarnani
Cisco Employee

Hi,

 

the TCAM is not necessarily faulty. the resources are used for interface creation. if this process fails we will see those uidb drops.

the smu that is available in 6.5.3 is to correct this by retrying, or in worse case resetting the NP.

 

the software will correct this issue from occurring, if it cannot it will try to restore . the challenge with this bug, is once it is hit. we do not get any real alert only NP drops  while the interfaces start black holing traffic.

 

i would recommend to install the smu if you have 6.4.2 6.5.3 or 6.6.2.

the engineering team is working on releasing one for 7.0.2 as well however it is not posted yet.

 

Thanks

Thank you for the detailed information!  I'll take a look at the SMU (though 6.5.3 CSCvs36064 SMU Readme lists 5 prerequisites, where they themselves also have prerequisites, it appears to have complex dependencies).

 

The second question that I have -- is it normal for this to occur when nobody was working on the router (it was just forwarding traffic in production, then stopped forwarding until 0/0/CPU0 is manually reloaded).  Meaning, we didn't have anyone log into the device to create any new interfaces or make any config changes that may have caused interface lists to change.  The device is doing LER/LSR duties -- it's not running BNG either that may have created or edited interfaces.

 

Thanks!

Hi James,

 

yes the smus and preq's can get complicated. there is a tool called CSM which can help alleviate this problem.

 

the interface creation was just an example i gave of TCAM use. the main issue is the interrupt/fault not being corrected once hit. i work in TAC, a similar case i had where all the bundle members of a particular card went down due to this issue.

 

all the bundle members went down, never recovered until the LC was reloaded

 

the software fix will help recover when this fault is detected. i would install the smu when able to ensure you dont feel this pain again


Thanks

 

 

View solution in original post