cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2613
Views
45
Helpful
11
Replies

Ask the Expert: Router & IOS Architecture And Unexpected Reboots on Routers

Lisa Latour
Level 6
Level 6

Welcome to this Cisco Support Community Ask the Expert conversation. This is an opportunity to learn and ask questions about Router and IOS Architecture and on Unexpected Reboots on IOS Routers like 7600, 2900, 3900, etc. that you might be facing in your environment which cause a huge impact on your services with Cisco expert Vinit Jain.

Ask questions from Wednesday, May 27th, 2015 to Tuesday, June 9, 2015

Different Routers have different architecture and different capabilities and features using which you can enhance routers performance and get certain tasks done. Reboots on the routers can happen mainly due to 3 factors:

  1. Software issue
  2. Hardware failure
  3. Memory issues (Parity Errors, Bad DRAMs). 

Vinit will be helping you with all your queries on all of the above.

Vinit Jain will also be speaking at Cisco Live in June 2015 on Troubleshooting BGP (BRKRST-3320). 
Click here for More Information

 

Vinit Jain, 3X CCIE #22854 is a Technical Lead in HTTS (High Touch Technical Support) team supporting customers in areas of routing, MPLS, TE, IPv6, multicast and a wide variety of platform issues like High CPU, Memory leak, etc IOS, IOS XE, IOS XR and NxOS code base. Has been delivering trainings within Cisco on various technology as well as platform troubleshooting topics. He has also written workbook on IOS XR fundamentals on Cisco Support Community. Vinit has CCIE in R&S, SP and Sec and holds multiple certifications on programming and databases.

Find other  https://supportforums.cisco.com/expert-corner/events.

**Ratings Encourage Participation! **
Please be sure to rate the Answers to Questions

11 Replies 11

Avinash Kumar
Level 1
Level 1

Hi Vinit,

I am working on an issue where in we are seeing very high TCAM utilization on multiple nodes - 7609-s.

The device is running an Engg special version.

The only WA is to reload the box which is disruptive.

# is there any non disruptive WA available?

# Is there a perm fix for this issue?

# what could be the root cause for this- config/design issue, hw/sw defect?

 

Regards,

Avi

Hello Avinash

Could you please share the following logs in a file:

- show tcam count
- show tcam count detail
- show mls cef adj usage
- show mls cef hardware
- show mls cef summary
- show mls cef exception status
- show module
- show version

Do we know which tcam is heavily utilized? Is there any event that triggers the issue?

Thanks,

Vinit

Thanks
--Vinit

Router1#show module

  5    2  Route Switch Processor 720 (Active)    RSP720-3CXL-GE    
  6    2  Route Switch Processor 720 (Hot)       RSP720-3CXL-GE    

 

Image

[XXX_v151_3_s1_es-xxx-XXX_v151_3_s1_es 169]

============================================

Router1#show tcam count
           Used        Free        Percent Used       Reserved
           ----        ----        ------------       --------
 Labels:(in) 22        4074            0
 Labels:(eg)  2        4094            0

ACL_TCAM
--------
  Masks:     99        3997            2                    72
Entries:    268       32500            0                   576

QOS_TCAM
--------
  Masks:      7        4089            0                    18
Entries:     42       32726            0                   144

    LOU:      8         120            6
  ANDOR:      1          15            6
  ORAND:      0          16            0
    ADJ:      3        2045            0

Router1#show tcam count detail
           Used        Free        Percent Used       Reserved
           ----        ----        ------------       --------
 Labels:(in) 22        4074            0
 Labels:(eg)  2        4094            0

ACL_TCAM
--------
HI BANK
  Masks:     64        1984            3                    72
Entries:    166       16218            1                   576

LOW BANK
  Masks:     35        2013            1                     0
Entries:    102       16282            0                     0

QOS_TCAM
--------
HI BANK
  Masks:      0        2048            0                    18
Entries:      0       16384            0                   144

LOW BANK
  Masks:      7        2041            0                     0
Entries:     42       16342            0                     0

    LOU:      8         120            6
  ANDOR:      1          15            6
  ORAND:      0          16            0
    ADJ:      3        2045            0

Router1#show mls cef adj usage
Adjacency Table Size:     1048576
ACL region usage:         3
Non-stats region usage:   106051
Stats region usage:       433893
Total adjacency usage:    539947
Router1#show mls cef hardware

  CEF TCAM v2:
  Size: 1048576 entries
        262144 rows/device, 4 device(s)
        32 entries/mask-block
        32768 total blocks (32b wide)
        4849664 s/w table memory
  Options:
        sanity check: on
        sanity interval: 301 seconds
        consistency check: on
        consistency interval: 11 seconds
        redistribution: off
            redistribution interval: 120 seconds
            redistribution threshold: 10
        compression: on
            compression interval: 31 seconds
        tcam/ssram shadowing: on
  Operation Statistics:
        Entries inserted:               0000000056771663
        Entries deleted:                0000000056278100
        Entries compressed:             0000000004028899
        Blocks inserted:                0000000000742257
        Blocks deleted:                 0000000000726725
        Blocks compressed:              0000000000317339
        Blocks shuffled:                0000000000199869
        Blocks deleted for exception:   0000000000000000
        Direct h/w modifications:       0000000000000000

  Background Task Statistics:
        Consistency Check count:        0000000003471970
        Consistency Errors:             0000000000000002
        SSRAM Consistency Errors:       0000000000000000
        Sanity Check count:             0000000000127602
        Sanity Check Errors:            0000000000000000
        Compression count:              0000000000176966

        Exception Handling status    : on
        L3 Hardware switching status : on
        Fatal Error Handling Status  : Reset  
        Fatal Errors:                   0000000000000000
        Fatal Error Recovery Count:     0000000000000000

  SSRAM ECC error summary:
        Uncorrectable ecc entries    : 0               
        Correctable ecc entries      : 0               
        Packets dropped              : 0               
        Packets software switched    : 0               

FIB SSRAM Entry status
----------------------
Key: UC - Uncorrectable error, C - Correctable error
        SSRAM banks  :  Bank0    Bank1
No ECC errors reported in FIB SSRAM.

Router1# show mls cef summary

Total routes:                     493566
    IPv4 unicast routes:          229382
    IPv4 Multicast routes:        381   
    MPLS routes:                  262577
    IPv6 unicast routes:          1226  
    IPv6 multicast routes:        3     
    EoM routes:                   0     
Router1#show mls cef exception status
Current IPv4 FIB exception state = FALSE
Current IPv6 FIB exception state = FALSE
Current MPLS FIB exception state = FALSE


====

 

 

 

 

Hello Avinash

looking at the logs, the IPv4 + MPLS TCAM is over 90+ %. Once it reaches 100%, the router will go into exception state. Refer to the below CCO Doc:

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/117712-problemsolution-cat6500-00.html

This doc also talks about how you can adjust tcam allocation which can be helpful but those changes require reload.

It seems like the increase in the internet table might have been causing the impact on customer's network.

Hope this helps.

Vinit

Thanks
--Vinit

Hi Vinit,

Router1#show mls cef exception status
Current IPv4 FIB exception state = FALSE
Current IPv6 FIB exception state = FALSE
Current MPLS FIB exception state = FALSE

What if all status set to TRUE ?

Hello,

This Ask the expert event is closed. Kindly post your question on the 

I hope you and your love ones are safe and healthy
Monica Lluis
Community Manager Lead

Hongju Jung
Level 1
Level 1

Hi Vint jain

 

My name is HJ Jung from Korean

 

i got supported ISP company they have lots of WS-C6500, 7600, CRS etc

 

But that device rebooting issue once a week that's why I got repoted to TAC team but they always answer parity error will be monitoring recommended usually.

However me and our customter can't make sense with monitroing recommedation under parity error issue. so I got some question as a below

 

1, can you explain what exactly parity error and why occured that kind issue form each platoform

2, Why must be reset or rebooting occured with parity error or other circumstance

3, We can control parity error detect scheudle or other control method via configuration

 

thanks.

Hello Mr. Jung

Those are really a good set of questions which are highly noticed in Cisco TAC. Regarding your questions, below are the answers for each of them:

1. There are two kinds of parity errors (Soft parity errors and Hard parity errors). Soft parity errors are the one's which happens once in a while and can be treated as transient hardware issues. Hard parity errors are actually hardware issues (in the error logs you can see Single-Bit, Double-Bit or Multi-Bit parity error) in which case you should replace your hardware. There is a good CCO documentation on Parity Errors shared below:

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116135-trouble-6500-parity-00.html

2. If you are getting parity errors in your environment very frequently, then you must consider checking the use of ESD in your data center by the engineers. Also, parity errors can be caused due to environmental factors. The above document has a good mention of ESD usage.

Please note if a parity error is occurring frequently on a single device, you need to replace it. It is advisable to replace the hardware if the device faces crash due a parity error more than once in 6 months. If its just once, we can treat it as soft parity error and monitor for another occurrence. If its more than once, we treat it as hard parity error and its advisable to replace the card which faced the parity error.

3. There is no configuration to control or detect parity error. The best way to control it is to increase the usage of ESD and make sure the environment is good enough for all those devices.

Along with the above stated, I would like to see one of your logs in which you have faced parity errors (either from 6500 or 7600 series routers).

Hope this information was helpful. 

Thanks,

Vinit

PS: Please do rate the post if you find the information useful.

Thanks
--Vinit

Hi Vint

 

thanks for the answer, but I have some question about ESD progress .

 

do you have any guide line how to ESD check it up on the Data centre,

If you have any ESD check list or progress documentation, please let me know it

 

thanks

Hello Hongju

i was able to find a Cisco training on ESD:

http://www.cisco.com/web/learning/le31/esd/WhatIsP.html

You can click on next on learning more about ESD.

There is another CCO documentation which talks a bit about ESD:

http://www.cisco.com/web/tsweb/pdf/Guidelines-and-Best-Practices.pdf

Hope this helps

Vinit

Thanks
--Vinit

Monica Lluis
Level 9
Level 9

Thank you for all your responses.

I hope you and your love ones are safe and healthy
Monica Lluis
Community Manager Lead

Review Cisco Networking for a $25 gift card