cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1519
Views
5
Helpful
1
Replies

Bad Super Frame errors on MDS directors

mattkauffmann
Level 1
Level 1

Not trying to create FUD or anything here, long time Cisco customer,  just thought folks should know. We've had two switches in the last 6 weeks with this issue. One we've had in service since 2008 (and which started having SF errors in March but didn't become and issue until we started to collapse our old 9509 core into it by putting the storage ports on it this past Sunday) but the other one was brand new.  If you're seeing odd issues between hosts (especially poorly multipathed OS' like ESX) you may want to run these commands on your switches.  We are going to have to swap out our second 9513 in 6 weeks in next couple of days because of this. They will ship your Xbars and you can try to swap those out and hope it fixes it but it hasn't for us yet (and one of the two Xbars they sent last night actually made the problem worse, not sure if IT was also bad or what.)

To check your modules for bad SF errors

sh logging onboard error-stats (if you want to see just recent ones use this as well in the command starttime  mm/dd/yy-hh:mm:ss error-stats)

if things are o.k. all of your modules will come back with nothing

----------------------------
    Module:  1
----------------------------
----------------------------
    Module:  2
----------------------------

etc etc

If they aren't you'll see this

----------------------------
    Module:  3
----------------------------


------------------------------------------------------------------------------
ERROR STATISTICS INFORMATION FOR DEVICE ID 59 DEVICE Skyline-xbar
------------------------------------------------------------------------------
                                                                      |    Time Stamp   |In|Port
    Error Stat Counter Name        |    Count       |MM/DD/YY HH:MM:SS|st|Rang
                                   |                |                 |Id|e
------------------------------------------------------------------------------
FI1_CNT_BAD_SF                           |2b              |01/20/11 21:55:20|00|1-24
FI1_CNT_BAD_DI_ERR_SF              |28              |01/20/11 21:55:20|00|1-24
FI1_CNT_AR_BAD_SF                     |2b              |01/20/11 21:55:20|00|1-24


------------------------------------------------------------------------------
ERROR STATISTICS INFORMATION FOR DEVICE ID 58 DEVICE Skyline-fwd
------------------------------------------------------------------------------
                                   |                |    Time Stamp   |In|Port
    Error Stat Counter Name        |    Count       |MM/DD/YY HH:MM:SS|st|Rang
                                   |                |                 |Id|e
------------------------------------------------------------------------------
AR1_FI_ERR_CNT                     |2b              |01/20/11 21:55:20|00|1-24
AR1_PARSER_ERR_CNT                 |11              |01/20/11 21:55:20|00|1-24
AR1_SUPERFRAME_FROM_FI_WITH_DROP_ON|2b              |01/20/11 21:55:20|00|1-24
AR1_PKT_FROM_FI_WITH_DROP_ON       |2b              |01/20/11 21:55:20|00|1-24
AR1_ERROR_PACKETS_DROPPED          |11              |01/20/11 21:55:20|00|1-24
TI_AR1_BAD_RESULT_EGR              |11              |01/20/11 21:55:20|00|1-24


------------------------------------------------------------------------------
ERROR STATISTICS INFORMATION FOR DEVICE ID 59 DEVICE Skyline-xbar
------------------------------------------------------------------------------
                                   |                |    Time Stamp   |In|Port
    Error Stat Counter Name        |    Count       |MM/DD/YY HH:MM:SS|st|Rang
                                   |                |                 |Id|e
------------------------------------------------------------------------------
AM_ACC_AR_ERR                      |2b              |01/20/11 21:55:20|00|1-24

One or two might not mean anything but if they continue to increment you should call Cisco TAC. This for us has meant there is a significant hardware problem with our switches and we have had to replace the entire switch each time. Module's, sup's, power supplies, everything. It's a simple enough comand and Cisco TAC has told us that 90% of the folks out there don't know about this command so I'm here to tell ya, you may want to run this across your switches (we're creating a script to run it once a day and e-mail us the results.)  Takes a few minutes, it might save you a lot of hassle especially if the switch isnt' in production yet. We're going to be swapping out a switch with 240 storage and host ports on it in the next couple of days.  Getting good at it and getting tired of it.

1 Reply 1

fanewton
Level 1
Level 1
I had this issue. Look for the indicators below. The module was causing superframe errors on other modules within the same VSAN. The following IPA errors indicate packets are being corrupted after being received. The module these are occurring on should be replaced.
38 THB_IPA_IPA1_CNT_BAD_CRC 0000000001883361 7-12 -
39 THB_IPA_IPA1_CNT_CORRUPT 0000000001883361 7-12
I had the same errors on module 1 as yourself. See below.
fc1/1-fc1/12 |FI1_CNT_AR_BAD_SF |774646 |07/05/18 15:11:36
fc1/1-fc1/12 |FI1_CNT_BAD_SF |774646 |07/05/18 15:11:36
fc1/1-fc1/12 |FI0_CNT_AR_BAD_SF |657052 |07/05/18 15:11:36
fc1/1-fc1/12 |FI0_CNT_BAD_SF |657052 |07/05/18 15:11:36
The following EBM errors are the same corrupted packets being received at the egress linecard. These will likely occur on several other modules that are receiving these corrupted packets. These other modules are not faulty. Replacing the module generating the IPA errors should stop these.
TBIRD_FWD_EPR1_PKT_CRC_ERR
THB_EBM0_CNT_ERR_QUEUE_DROP_SF

Review Cisco Networking for a $25 gift card