Ask the Expert: Router & IOS Architecture And Unexpected Reboots on Routers

Lisa Latour · ‎05-26-2015

Welcome to this Cisco Support Community Ask the Expert conversation. This is an opportunity to learn and ask questions about Router and IOS Architecture and on Unexpected Reboots on IOS Routers like 7600, 2900, 3900, etc. that you might be facing in your environment which cause a huge impact on your services with Cisco expert Vinit Jain.

Ask questions from Wednesday, May 27th, 2015 to Tuesday, June 9, 2015

Different Routers have different architecture and different capabilities and features using which you can enhance routers performance and get certain tasks done. Reboots on the routers can happen mainly due to 3 factors:

Software issue
Hardware failure
Memory issues (Parity Errors, Bad DRAMs).

Vinit will be helping you with all your queries on all of the above.

Vinit Jain will also be speaking at Cisco Live in June 2015 on Troubleshooting BGP (BRKRST-3320).
Click here for More Information

Vinit Jain, 3X CCIE #22854 is a Technical Lead in HTTS (High Touch Technical Support) team supporting customers in areas of routing, MPLS, TE, IPv6, multicast and a wide variety of platform issues like High CPU, Memory leak, etc IOS, IOS XE, IOS XR and NxOS code base. Has been delivering trainings within Cisco on various technology as well as platform troubleshooting topics. He has also written workbook on IOS XR fundamentals on Cisco Support Community. Vinit has CCIE in R&S, SP and Sec and holds multiple certifications on programming and databases.

Find other https://supportforums.cisco.com/expert-corner/events.

**Ratings Encourage Participation! **
Please be sure to rate the Answers to Questions

Avinash Kumar · ‎05-26-2015

Hi Vinit,

I am working on an issue where in we are seeing very high TCAM utilization on multiple nodes - 7609-s.

The device is running an Engg special version.

The only WA is to reload the box which is disruptive.

# is there any non disruptive WA available?

# Is there a perm fix for this issue?

# what could be the root cause for this- config/design issue, hw/sw defect?

Regards,

Avi

Vinit Jain · ‎05-26-2015

Hello Avinash

Could you please share the following logs in a file:

- show tcam count
- show tcam count detail
- show mls cef adj usage
- show mls cef hardware
- show mls cef summary
- show mls cef exception status
- show module
- show version

Do we know which tcam is heavily utilized? Is there any event that triggers the issue?

Thanks,

Vinit

Thanks
--Vinit

Avinash Kumar · ‎05-26-2015

Router1#show module

5 2 Route Switch Processor 720 (Active) RSP720-3CXL-GE
6 2 Route Switch Processor 720 (Hot) RSP720-3CXL-GE

Image

[XXX_v151_3_s1_es-xxx-XXX_v151_3_s1_es 169]

============================================

Router1#show tcam count
           Used        Free        Percent Used       Reserved
           ----        ----        ------------       --------
Labels:(in) 22        4074            0
Labels:(eg) 2        4094            0

ACL_TCAM
--------
Masks: 99 3997 2 72
Entries: 268 32500 0 576

QOS_TCAM
--------
Masks: 7 4089 0 18
Entries: 42 32726 0 144

    LOU:      8         120            6
ANDOR:      1          15            6
ORAND:      0          16            0
    ADJ:      3        2045            0

Router1#show tcam count detail
           Used        Free        Percent Used       Reserved
           ----        ----        ------------       --------
Labels:(in) 22        4074            0
Labels:(eg) 2        4094            0

ACL_TCAM
--------
HI BANK
Masks: 64 1984 3 72
Entries: 166 16218 1 576

LOW BANK
Masks: 35 2013 1 0
Entries: 102 16282 0 0

QOS_TCAM
--------
HI BANK
Masks: 0 2048 0 18
Entries: 0 16384 0 144

LOW BANK
Masks: 7 2041 0 0
Entries: 42 16342 0 0

    LOU:      8         120            6
ANDOR:      1          15            6
ORAND:      0          16            0
    ADJ:      3        2045            0

Router1#show mls cef adj usage
Adjacency Table Size:     1048576
ACL region usage:         3
Non-stats region usage:   106051
Stats region usage:       433893
Total adjacency usage:    539947
Router1#show mls cef hardware

CEF TCAM v2:
Size: 1048576 entries
        262144 rows/device, 4 device(s)
        32 entries/mask-block
        32768 total blocks (32b wide)
        4849664 s/w table memory
Options:
        sanity check: on
        sanity interval: 301 seconds
        consistency check: on
        consistency interval: 11 seconds
        redistribution: off
            redistribution interval: 120 seconds
            redistribution threshold: 10
        compression: on
            compression interval: 31 seconds
        tcam/ssram shadowing: on
Operation Statistics:
        Entries inserted:               0000000056771663
        Entries deleted:                0000000056278100
        Entries compressed:             0000000004028899
        Blocks inserted:                0000000000742257
        Blocks deleted:                 0000000000726725
        Blocks compressed:              0000000000317339
        Blocks shuffled:                0000000000199869
        Blocks deleted for exception:   0000000000000000
        Direct h/w modifications:       0000000000000000

Background Task Statistics:
        Consistency Check count:        0000000003471970
        Consistency Errors:             0000000000000002
        SSRAM Consistency Errors:       0000000000000000
        Sanity Check count:             0000000000127602
        Sanity Check Errors:            0000000000000000
        Compression count:              0000000000176966

        Exception Handling status    : on
        L3 Hardware switching status : on
        Fatal Error Handling Status : Reset
        Fatal Errors:                   0000000000000000
        Fatal Error Recovery Count:     0000000000000000

SSRAM ECC error summary:
        Uncorrectable ecc entries    : 0
        Correctable ecc entries      : 0
        Packets dropped              : 0
        Packets software switched    : 0

FIB SSRAM Entry status
----------------------
Key: UC - Uncorrectable error, C - Correctable error
SSRAM banks : Bank0 Bank1
No ECC errors reported in FIB SSRAM.

Router1# show mls cef summary

Total routes:                     493566
    IPv4 unicast routes:          229382
    IPv4 Multicast routes:        381
    MPLS routes:                  262577
    IPv6 unicast routes:          1226
    IPv6 multicast routes:        3
    EoM routes:                   0
Router1#show mls cef exception status
Current IPv4 FIB exception state = FALSE
Current IPv6 FIB exception state = FALSE
Current MPLS FIB exception state = FALSE

====

Vinit Jain · ‎05-27-2015

Hello Avinash

looking at the logs, the IPv4 + MPLS TCAM is over 90+ %. Once it reaches 100%, the router will go into exception state. Refer to the below CCO Doc:

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/117712-problemsolution-cat6500-00.html

This doc also talks about how you can adjust tcam allocation which can be helpful but those changes require reload.

It seems like the increase in the internet table might have been causing the impact on customer's network.

Hope this helps.

Vinit

Thanks
--Vinit

vasanth77 · ‎03-02-2016

Hi Vinit,

Router1#show mls cef exception status
Current IPv4 FIB exception state = FALSE
Current IPv6 FIB exception state = FALSE
Current MPLS FIB exception state = FALSE

What if all status set to TRUE ?

Monica Lluis · ‎03-02-2016

Hello,

This Ask the expert event is closed. Kindly post your question on the

I hope you and your love ones are safe and healthy
Monica Lluis
Community Manager Lead

Hongju Jung · ‎05-27-2015

Hi Vint jain

My name is HJ Jung from Korean

i got supported ISP company they have lots of WS-C6500, 7600, CRS etc

But that device rebooting issue once a week that's why I got repoted to TAC team but they always answer parity error will be monitoring recommended usually.

However me and our customter can't make sense with monitroing recommedation under parity error issue. so I got some question as a below

1, can you explain what exactly parity error and why occured that kind issue form each platoform

2, Why must be reset or rebooting occured with parity error or other circumstance

3, We can control parity error detect scheudle or other control method via configuration

thanks.

Vinit Jain · ‎05-27-2015

Hello Mr. Jung

Those are really a good set of questions which are highly noticed in Cisco TAC. Regarding your questions, below are the answers for each of them:

1. There are two kinds of parity errors (Soft parity errors and Hard parity errors). Soft parity errors are the one's which happens once in a while and can be treated as transient hardware issues. Hard parity errors are actually hardware issues (in the error logs you can see Single-Bit, Double-Bit or Multi-Bit parity error) in which case you should replace your hardware. There is a good CCO documentation on Parity Errors shared below:

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116135-trouble-6500-parity-00.html

2. If you are getting parity errors in your environment very frequently, then you must consider checking the use of ESD in your data center by the engineers. Also, parity errors can be caused due to environmental factors. The above document has a good mention of ESD usage.

Please note if a parity error is occurring frequently on a single device, you need to replace it. It is advisable to replace the hardware if the device faces crash due a parity error more than once in 6 months. If its just once, we can treat it as soft parity error and monitor for another occurrence. If its more than once, we treat it as hard parity error and its advisable to replace the card which faced the parity error.

3. There is no configuration to control or detect parity error. The best way to control it is to increase the usage of ESD and make sure the environment is good enough for all those devices.

Along with the above stated, I would like to see one of your logs in which you have faced parity errors (either from 6500 or 7600 series routers).

Hope this information was helpful.

Thanks,

Vinit

PS: Please do rate the post if you find the information useful.

Thanks
--Vinit

Hongju Jung · ‎05-27-2015

Hi Vint

thanks for the answer, but I have some question about ESD progress .

do you have any guide line how to ESD check it up on the Data centre,

If you have any ESD check list or progress documentation, please let me know it

thanks

Vinit Jain · ‎05-27-2015

Hello Hongju

i was able to find a Cisco training on ESD:

http://www.cisco.com/web/learning/le31/esd/WhatIsP.html

You can click on next on learning more about ESD.

There is another CCO documentation which talks a bit about ESD:

http://www.cisco.com/web/tsweb/pdf/Guidelines-and-Best-Practices.pdf

Hope this helps

Vinit

Thanks
--Vinit

Monica Lluis · ‎09-02-2015

Thank you for all your responses.

I hope you and your love ones are safe and healthy
Monica Lluis
Community Manager Lead