%CONST_DIAG-SP-STDBY-4-ERROR_COUNTER_WARNING: Module 6 Error counter exceeds threshold, system opera...

skoirala · ‎10-06-2015

we did redundancy force-switchover, reinserted all cards 2 times. Chassis looks good. Still getting this ''%CONST_DIAG-SP-STDBY-4-ERROR_COUNTER_WARNING: Module 6 Error counter exceeds threshold, system operation continue.'' and the TestErrorCounterMoniter is Failing and error-counter is increasing so high from module 6 in catalyst 6509NEB-A. Anybody has same issue before and resolve it? any suggestions are appreciated in advanced.

Mark Malone · ‎10-06-2015

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/41265-186-ErrormsgIOS-41265.html#module

%CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 7 Error counter exceeds threshold, system operation continue

Problem

The switch reports this error message:

%CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 7 Error counter 
exceeds threshold, system operation continue.
%CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:42 IN:0 PO:255 RE:200 RM:255 DV:2 EG:2 CF:10 TF:117

Description

Check the diagnostic results:

TestErrorCounterMonitor ---------> .

    Error code ------------------> 0 (DIAG_SUCCESS)
    Total run count -------------> 33658
    Last test execution time ----> Apr 15 2012 11:17:46
    First test failure time -----> Apr 03 2012 20:11:36
    Last test failure time ------> Apr 08 2012 19:24:47
    Last test pass time ---------> Apr 15 2012 11:17:46
    Total failure count ---------> 5
    Consecutive failure count ---> 0
    Error Records ---------------> n/a

The TestErrorCounterMonitor monitors the errors/interrupts on each module in the system by periodically polling for the error counters maintained in the line card.

This error message pops up when an ASIC on the line card receives packets with bad CRC. The issue can be local to this module or can be triggered by some other faulty module in the chassis. This can also be due to frames with bad CRC received by pinnacle asic from the DBUS. That is, the error messages imply that bad packets are being received across the bus on module 7.

One of the reasons for the error messages to occur is the inability of the module to properly communicate with the backplane of the chassis due to mis-seating of the module. The problem is with the line card (mis-seated module), supervisor or the Data Bus. However, it is not possible to say what component is corrupting the data and causing a bad CRC.

Workaround

First perform a re-seat of module 7 and make sure the screws are tightened well. Also, before the reseat, set the diagnostics to complete with the diagnostic bootup level complete command.
Once the re-seat is done, full diagnostics will run on the module. Then, you can confirm that there are no hardware issues on the module 7.

skoirala · ‎10-06-2015

Thank you Mark. But I had this description before, but all services are working good, no interrupts at all. and Hardware is also good. But wondering where these errors came from? and replaced all things, re-seat module tightly, still error is continuing.

Mark Malone · ‎10-07-2015

Does the diagnostic show anything can you post it, are you only receiving alarms for that module , if its coming back clean I would still RMA it if those alarms are continually coming in as there hardware related alarms and not cosmetic alarms

skoirala · ‎10-12-2015

No network issues yet, and many networks are connected from this switch. So very hard to do RMA unless having cleared information. I am still looking for what errors mean and how it impacts. Could you please provide detail info if any? here is diagnostic result:

MPLSMNDT33W-CORE-MOE#sh diagnostic result module 6 test 34 detail

Current bootup diagnostic level: minimal

Test results: (. = Pass, F = Fail, U = Untested)

___________________________________________________________________________

34) TestErrorCounterMonitor ---------> F

Error code ------------------> 1 (DIAG_FAILURE)
Total run count -------------> 16534
Last test testing type ------> Health Monitoring
Last test execution time ----> Oct 12 2015 15:47:59
First test failure time -----> Oct 06 2015 10:35:12
Last test failure time ------> Oct 12 2015 15:47:59
Last test pass time ---------> Oct 10 2015 15:18:30
Total failure count ---------> 9453
Consecutive failure count ---> 5370
Error Records ---------------> n/a
___________________________________________________________________________

Mark Malone · ‎10-12-2015

Your failing diagnostics its in the output you just provided , that blade could work until tomorrow or next month or could fail at any time

always best to have a planned downtime fix than a hard down in the middle of night when it fails by itself , you should not have failure counts on diagnostics check on any module

If you run the show tech through the output interpreter on Cisco website it should pick up the problem and provide further hardware details

http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-switches/prod_white_paper0900aecd801e659f.html

skoirala · ‎10-12-2015

We have had 3 different SUP720’s in slot 6 of this switch that all had the same error occurring. So would you suggest to replace chassis or blade itself again?

Mark Malone · ‎10-13-2015

Yes that does not sound good that the same issue is occurring on 3 different sups , I presume you have tested these sups in a different slot to confirm they work fine normally

The pins may be damaged at the back of the chassis in that particular slot its hard to say without looking at it, I wouldn't go replacing any chassis yet until there's full confirmation that's the problem

I have checked multiple 6500s here in my network and ran diags all clean no errors on any so something is definitely not right on yours , if it was on my network I would firsat run the show tech through the output interpreter on Cisco website https://www.cisco.com/cgi-bin/Support/OutputInterpreter/home.pl this will give you an overall hardware analysis and may pick something up , if it does not I would look at going through TAC as something will probably need an RMA

If you have no support with TAC move away from that slot if possible , the fact that you have reseated it, had numerous blades in the slot with the same issue would to me indicate there is some form of hardware failure there, even the error output on the website states that if a reseat is done and errors still appear there's an issue

skoirala · ‎10-15-2015

We replaced SFP, errors gone. Errors continue when SFP inserted back. Chassis replacement is not easy. So wondering about it again.

skoirala · ‎10-12-2015

Hi,
Completing last comment, we did re-seat, replaced module 2 times still error is growing up. One time error was gone and completely cleared, and started again since yesterday. any suggestions and reccomendations?

Appreciate it.

adorins · ‎06-14-2016

Some bits from my recent experience, may be someone founds it helpful...

7606 chassis recently started to log messages like this, pointing to slots 2 and 5:

Jun 14 05:09:15.818: %CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 2 Error counter exceeds threshold, system operation continue.
Jun 14 05:09:15.818: %CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:48 IN:0 PO:255 RE:679 RM:255 DV:131 EG:2 CF:10 TF:10

Jun 14 05:09:15.818: %CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 2 Error counter exceeds threshold, system operation continue.
Jun 14 05:09:15.818: %CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:49 IN:0 PO:11 RE:95 RM:0 DV:133 EG:2 CF:10 TF:10
Jun 14 06:33:49.060: %CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 5 Error counter exceeds threshold, system operation continue.
Jun 14 06:33:49.060: %CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:42 IN:0 PO:255 RE:200 RM:255 DV:6 EG:2 CF:10 TF:78
Jun 14 07:28:34.429: %CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 5 Error counter exceeds threshold, system operation continue.
Jun 14 07:28:34.429: %CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:42 IN:0 PO:255 RE:200 RM:255 DV:3 EG:2 CF:10 TF:131

Also we got a complaints about service degradation (packet loss) on vlan which goes out of interface Gig2/12 (WS-X6724-SFP), which has an optic link to some other switch. No any errors on ports at both ends. We arranged a measurements for optic links, and also changed SFPs at both ends - no luck.

Problem gone after we changed a port from suspected Gi2/12 to another free port an put a Gi2/12 in shutdown state. Services now works flawlesly and any error logs are gone

WS-X6724-SFP has two port asics (rohini), one for each 12 ports. I had no luck to find out info about how ports are numbered on rohini asic internaly, but it looks to me that it starts from 0 - in that case error log shows a probably affected port:

CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:49 IN:0 PO:11 RE:95 RM:0 DV:133 EG:2 CF:10 TF:10

br

Agris