10-25-2022 12:56 AM
Hello
our platform is 2 6454 FI , 2 MDS switch and a netapp Bay. on our UCS we have only ESXi hosts with boot on san
we have discovered a massive datastore corruption recently during some FI cluster test to check if all is ok before going to production.
it seems that the ESXi host loose sometime the SAN connection. it was clear when we evacuated the traffic from FI-A and run only with FI-b . error existe . when the traffic go throught FI-A no error
to connect the FI to the MDS , we have a 4 FC link Port Channel as recommended in the documentation of Flexpod.
to troubleshoot the issue , we disable each FC port and found that FC 1/4 of FI-B was the source of error . we have replace it and no more errors since.
in the counter of this FC port , we can found errors :
fc1/4
5 minutes input rate 81483648 bits/sec, 10185456 bytes/sec, 5782 frames/sec
5 minutes output rate 79245504 bits/sec, 9905688 bytes/sec, 5315 frames/sec
3449553 frames input, 1776008920 bytes
0 class-2 frames, 0 bytes
0 class-3 frames, 1776008920 bytes
0 class-f frames, 0 bytes 86 discards, 91 errors, 0 CRC/FCS
0 unknown class, 0 too long, 5 too short
3180897 frames output, 1632955644 bytes 0 class-2 frames, 0 bytes 3180897 class-3 frames, 1632955644 bytes 0 class-f frames, 0 bytes 0 discards, 0 errors 0 timeout discards, 0 credit loss 0 input OLS, 0 LRR, 0 NOS, 0 loop inits 0 output OLS, 0 LRR, 0 NOS, 0 loop inits 0 link failures, 0 sync losses, 0 signal losses Receive B2B Credit performance buffers is 0 22 transmit B2B credit remaining 0 low priority transmit B2B credit remaining Last clearing of "show interface" counters 00:15:22
my wquestion is why this FC port wasnt disable by the USCM as soon as errors were traced ?
our configuration to use a FC port Channel with 4 link is ok ? or should we have use individual FC uplink ? one of my team member tell me that port channel protocol doesnt deal very well with intermittent errors.
How should we react if we spot FC error ?
thanks for your help
10-25-2022 06:03 AM
UCSM could down the links if there are errors on a port, but what if those errors propagate to the other ports?
Should UCSM shut down those newly error'd ports? (Hopefully you can see the slippery slope here.)
This slippery slop is why UCSM decides to "do nothing" when there are errors on a switch port.
Do you have sufficient (SNMP) monitoring to detect errors on switch ports throughout your environment?
BTW these counters look oddly familiar to something reviewed in a TAC case review this week.
I would suggest you work with the TAC engineer on the TAC case for these types of questions and answers.
10-25-2022 06:22 AM
hello
thanks for your response , i ve a Case open also , i ll check with the TAC
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide