08-16-2013 03:56 AM - edited 03-01-2019 11:12 AM
I have a customer, having a simple configuration with one chassis, running UCS 2.1.2a; suddenly, the fans run at full speed, then he recognizes, that they lost communication FI - IOM on both fabrics. see file for error message
Power cycle the chassis resolved the issue
Has this been seen in the field ?
08-16-2013 04:36 AM
Hello Walter,
I assume chassis did not lose power
Before power cycling the chassis,
-- Did they checked LED status of IOMs, blades ?
-- Did they try reseating IOMs ?
Please ask them to open a TAC service request with UCSM and Chassis techsupport log bundle.
Padma
08-16-2013 06:02 AM
Hi Padma
Customer just did power cycle the chassis, no other information is available; therefore I I posted the show tech files in the original message above. I would not be surprised, that this was the result of a total power failure, and the FI took much longer to boot, than the IOM !
Walter.
08-16-2013 06:27 AM
FI did definitely not go down
Hardware
cisco UCS 6248 Series Fabric Interconnect ("O2 32X10GE/Modular Universal Platf
orm Supervisor")
Intel(R) Xeon(R) CPU with 16622556 kB of memory.
Processor Board ID FOC17101ST9
Device name: FI-BAL16-1-B
bootflash: 29535848 kB
Kernel uptime is 15 day(s), 23 hour(s), 40 minute(s), 51 second(s)
Last reset
Reason: Unknown
System version: 5.0(3)N2(2.11.2a)
Service:
08-19-2013 11:20 PM
We had the exact same issue with one of our chassis. After several tac cases it turned out there was a recall on the the PSU in that chassis. These PSU corrupted the I2C bus which caused these symptoms.
08-19-2013 11:45 PM
Thanks ! Would you mind sharing with us the TAC case nr.
08-19-2013 11:57 PM
Even better ( I guess)
Gold AC PSUs (N20-PAC5-2500W) below revision version of 341-0293-10 are missing fixes implemented via ECO E106290. This fix was applied to SN QCI1534A2YR and later. One of the useful things to know with the PSUs is the manufacturing date. To figure out when the PSU was manufactured you take the first 2 numbers and add them to 1996. The next two digits are the manufacturing week. So SN QCI1534A2YR was manufactured in week 34 of 2011 (Aug 22-28).
Platinum PSUs have the fix, however they had issues when they were first released that look similar to i2c – check hot issues on dcn-wiki for more info. (CSCtz59519 / CSCtx90410
)
So if the manufacturing date is key to see if you might be affected by it.
08-20-2013 12:45 AM
Thanks ! I think this might be Field Notice http://www.cisco.com/en/US/ts/fn/636/fn63628.html
Revised August 7, 2013
July 16, 2013
08-28-2013 06:57 AM
Unfortunately, above FN was not applicable for our case. Therefore customer opened a TAC case SR: 627085171
08-28-2013 03:35 PM
wdey
{Disclaimer: I have not checked the logs yet}
Since there is a case opened already, be sure to check if there is a memory leak, according to CSCuf61116, that issue should be fixed on the version the customer is running, but it is always worth it to be sure I never rule out until I can confirm the issue is definitely ruled out
-Kenny
08-29-2013 12:07 AM
Thanks Kenny ! Customer actually has a second UCS domain, exact same configuration (hardware and software), which didn't show this problem. One thing I noticed however, that the 2 datacenter run at different temperature. Could it be temperature issue ! The FI out-temp show 55 Degree C.
Walter.
08-29-2013 10:00 AM
Walter,
If that is the case, do you know if your customer has call home (SCH) set up for such events? that might help track that as a possible factor.... Has this happened more than once?
The TAC engineer suspects a power failure apparently, SCH can help with that also.... were there any other devices in the same rack/site/power circuit affected at the same time ? or the issue was isolated to this chassis only?
See below how to set up the policy for SCH to track this, just in case you need it, but I am sure you know how to, but maybe for others
Good luck.
*RCA= ROOT CAUSE ANALYSIS
-Kenny
Message was edited by: Keny Perez
09-10-2013 12:10 PM
For your information:
The issue seems to be on particular IOM Modules with below version numbers and this is tracked under the following bug
PART NUM : 73-13196-04
PN REVISION : C0
FAB REVISION : 4
RMA is initiated
Thanks all who contributed !
Walter.
09-10-2013 01:00 PM
thanks for updating the thread Walter.
-Kenny
11-04-2013 11:57 AM
Customer replaced all the IOM according to the above
After 2 1/2 months, the same happened again; chassis isolated, fans running full speed. A new TAC has been opened.
I cannot believe that we are the only folks having this issue ?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide