Solved: Re: CSCwk91747 - VFC Flap during FI Reboot - Page 2

System Administrator · ‎10-18-2024

This is an appalling bug. This bug ID does NOT exist in UCS 4.3 release notes! To add insult to injury, UCS 4.3 (4a) is a 'gold star' recommended version at the time of writing! WHY!? We have had detrimental fallout as a result of this bug and still have an identical UCS domain to bring up to the same version. This is gross negligence on the part of Cisco keeping customers informed of critical bugs.

For salt in the wound, the bug page currently reflects a "fixed" status with no documentation of a fixed version. No workaround provided either. This needs to be a Sev-1 bug. Created July 30th with no additional information, and pushing a golden image firmware version with this bug? Shame Cisco, shame!

This bug needs priority attention, now.

System Administrator · ‎10-23-2024

The latest developments we've noticed:

CSCwk91747 has received another update. The Bug ID page now reflects 4.3 (5a) UCSM as the "Known Fixed Release." If anyone has the luxury of owning a full non-production or lab stack of UCS equipment with FC connectivity and decides to serve as a guinea pig with 4.3 (5a), many inquiring minds would love to know the end result.

Another major development is that the Software Downloads page for UCS Infrastructure Software Bundles FINALLY shows 4.3 (4a) has had its gold star revoked! Incredible that such a nasty bug known as far back as July 30th survived this long, and it took nearly a week of nagging to get the gold star ripped off. The greatest 'gold star' firmware release for 4.3 is now listed as 4.3 (3c).

Internally, we are working out the lengthy, painstaking planning process. We will be gracefully shutting down all impacted workloads that are bound to our bugged UCS domain. For us, this represents just under 200 virtual machines of varying complexity. All other workloads (close to 900 VMs) will be vMotion'd to another UCS domain in our stretched/metro cluster environment.

Nothing like cramming this level of sophisticated outage right before staff start chewing through PTO near Turkey Day!

kevin.robson.1 · ‎10-28-2024

we were informed by Cisco this morning that the 4.3(4e) firmware has been released and we have installed it on our test lab.

https://software.cisco.com/download/home/283612660/type/283655658/release/4.3(4e)

The update itself appears to have been a success at least on this limited test environment. we did not have a load like we would have in our production environments but the domain and host We tested the upgrade on has FC connectivity to several data stores. We did not see any adverse effects during this upgrade, just the expected path redundancy lost while evac / activate was taking place on each side of the fabric.

Load or no load, I believe that if the issue were going to recur during the upgrade we would have seen APD alerts on this host, but we did not.

After the upgrade to 4.3(4e) we tested the following in our test UCSM Domain:

Rediscovered a couple other hosts connected to the domain (so far no fnic resets / VFC flaps)
Removed and re-added host profiles (so far no fnic resets / VFC flaps)
Added a new VLAN to vNIC templates (no fnic resets / VFC flaps)
Reset CIMC on host (no fnic resets / VFC flaps)

So far, so good! Knock on wood!

Continuing to monitor for anything out of the ordinary.

jonmiller5555 · ‎10-24-2024

Just got this reply from TAC. Not sure I completely believe it, but figured I would post here. We're still deciding how to address the issue. TAC Reply:

After speaking with engineering team, it seems the fixed versions for this is 4.3.(5a)UCSM , which is now available.

There shouldn't be any impact on the VFCs as the upgrade executes the code changes prior to rebooting. This ensures the fixed firmware is staged and will be the code implemented upon reboot.

System Administrator · ‎10-24-2024

Thanks for the update @jonmiller5555. We have heard from another Cisco UCS customer of a 4.3 (4e) update pending release. We have also been told by our TAC engineers that an upgrade to 4.3 (5a) or the soon-to-be 4.3 (4e) would be non-disruptive and would not result in FC/vHBA flapping. We have pushed back to our assigned TAC engineer and TAC manager for more details and extra confirmation (you know, because Cisco destroyed all trust). Something that continues to be a mystery is exactly what sub-system/part of code/etc., is to blame for this bug.

And now, it leads to further questions of how it can be remediated in a firmware upgrade without instigating the bug. If the bug is prevalent during FI reboots, and the FI needs to reboot to activate the new (fixed) firmware, how do they otherwise slip in the fix prior to said reboot. At the moment, we have better odds determining the number of licks it takes to get to the center of a Tootsie pop...(the world may never know).

kdmh · ‎10-25-2024

Experienced this bug myself on upgrade. In our experience, ANY shallow discovery will trigger the FC disconnect - I've been able to replicate the issue not only with an FI reboot, but IOM and CIMC resets as well.

System Administrator · ‎10-25-2024

Thanks for the insights @kdmh. That seems to be the consensus at the moment...any activity or administrative action that results in a shallow discovery of a service profile will result in this bug showing up.

The number of cases is now officially 19, but I'm confident the number is higher. Based on discussions with other Cisco UCS shops, it sounds like several customers may have been given the cold shoulder and were told by TAC to look elsewhere. I personally suspect that many SRs were closed with no reference to CSCwk91747, and certain customers are now starting to come back out of the woodwork with their experiences of odd FC reset behavior (when doing things like updating vNICs/VLANs/CIMC resets/IOM resets, etc.).

I urge everyone reading this: if you had an SR intially and you're circling back to this issue, please re-open your existing SR if you can, or open a new one. If you haven't opened an SR, you found yourself here with the same problem, and you have a support contract, PLEASE open an SR. Even if you've already come up with a conclusion on the remediation path you'll take with the information available, open an SR (or even a proactive SR). I'll even go as far as to suggest that if you've already resolved this by moving to an older version or took a chance on the newest 4.3 (5a), STILL open an SR with TAC for the paper trail and to document your experience. Make sure TAC is associating your SR to CSCwk91747!

Numbers speak volumes. TAC needs to be aware of every customer that is impacted. This type of bug is inexcusable. Cisco needs to stop focusing on layoffs. They need to stop focusing on cramming so many new features into UCS that the majority don't want or need. Cisco TAC needs to stop putting immediate onus on other third parties (VMware, Microsoft, etc.) when an SR is opened. If a UCS firmware version has bugs, the NEXT release out of Software Downloads better be one that ONLY remediates said bugs....I don't need more bells and whistles shoved in the same release, nor should I need to move to an upper branch of firmware to get the bug fixed.

Cisco owes all of us an apology.

wko@cisco.com · ‎10-30-2024

CSCwm30262 and CSCwk91747 have the same root cause in UCSM code base. The fix has been verified and released in UCSM 4.3(4e) and 4.3(5a).

The defect was introduced to the internal UCSM 4.3(4) code base, escaped our testing and released in UCSM 4.3(4a). The effect of this defect is a 3-5 seconds FC traffic interruption from an interface bounce on both Fabric Interconnect (FI) switches when UCSM programs a vFC interface of a 6400- and 6500-series FI. FI reboot, VLAN id changes, certain UCSM process restart, and CIMC reset can run into this defect. For OS and applications that are I/O sensitive, this 3-5 seconds interruption are severe to catastrophic.

The fix is in UCSM code. Since UCSM domain update is to update UCSM code first followed by NXOS, updating to a build with the fix means the fix will take effect before a FI reboot. Hence the domain upgrade to the fixed build will not encounter the issue.

System Administrator · ‎11-04-2024

Thank you, mysterious recent wko@cisco.com employee. This is the first time someone directly related to Cisco is acknowledging a specifical component of the UCS stack associated with the bug, let alone has provided reasoning for why an upgrade (or downgrade) to firmware outside of bugged version would not result in FC connectivity issues. It only took half a month to get this sort of confirmation.

jonmiller5555 · ‎10-30-2024

Thanks for the additional information. We are going to attempt to upgrade to 4.3(5a) on November 10th. I'll open a TAC case ahead of time and also report back to let everyone know how it goes.

martinc_intact · ‎10-30-2024

We've hit the same situation on our end.. Our target release was 4.3(4c), since 4.3(4a) was a star release for some time and (4c) was fixing a high CVE. It was making sense... So we started to deploy early october until we faced those fiber channel issues and discover this bug last week. We paused everything..

Now, we just upgraded some domains to 4.3(4e) tonight and only UCS-M got updated because the other images included in the package stay the same for FI and IOM (4.3.4b). So for us, it was a 15 minutes upgrade to hopefully have a peaceful mind. Still more domains to fix though... tomorrow.

System Administrator · ‎11-04-2024

Thank for the details on your end @martinc_intact. Interesting note on the FIs and IOMs remaining on 4.3 (4b). Please do keep us apprised of your other domains and how the upgrades fair for you.

Eastender_admin · ‎11-11-2024

Cisco just deferred 4.3.4 (pre-E) releases

https://www.cisco.com/c/en/us/support/docs/field-notices/742/fn74209.html

System Administrator · ‎11-11-2024

Thanks for the details on the field noticed, @Eastender_admin.

We are still astonished that it has taken Cisco months to be forthcoming with this information and take appropriate action in deprecating the origin version of this issue: 4.3 (4a). Instead, they sat quiet, released three more version with the same bug, made the origin version a gold-star version, and gave the rest of us the run-around when it came to figure out what the heck happened to our historical NDUs.

Our organization is in final preparations to move to 4.3 (4e) for the domain that is bugged. I'm not thrilled sticking with 4.3 (4) but the remaining known caveats are minimal or otherwise irrelevant to our particular environments....at least the caveats that Cisco has been transparent on sharing in public documentation.

Have others made any advances/success with either reversion or upgrade to fixed versions?

jonmiller5555 · ‎11-18-2024

I was delayed a week internally due to some other work, but we did successfully upgrade last night from 4.3(4c) to 4.3(5a) with no issues during the upgrade. It's only been 12 hours, but no issues observed so far and I'm hoping we're past this one.

System Administrator · ‎11-18-2024

@jonmiller5555, thanks for the update.

Our organization recently completed an upgrade as well on two of our UCS domains. For one of our domains already running on the bugged gold-star Version 4.3 (4a), we upgraded to 4.3 (4e). Though we know of other UCS customers that performed the same version upgrade path seamlessly, we had a peculiar situation in which the first FI (subordinate) being update apparently stalled out in the activation process. Thankfully, there was no direct impact to workloads.

Strangely enough, the FSM step for activation of the firmware on the subordidinate FI timed out at 20 retires. The firmware auto-install process initially displayed as 'failed' but the then automatically restarted and jumped immediately back to the activation step. However, the FI never completed activation and rebooted. We contacted Cisco TAC and much to our dismay, the engineer assisting with the troubleshooting was painfully unaware of what they were doing. It was very clear to us that the engineer was merely clicking around the GUI until a more advanced engineer could direct them to check specific things. Sadly, this inferior level of support is more par for the course these days. But I digress on Cisco's depreciating cost vs. value in technical support these days...

Long story short, we forced a manual reboot of the subordinate FI and it came back up and completed activation. Why this upgrade time out did not have a hard escape is beyond us. We chewed up half of our maintenance window waiting to reach the halfway point of the upgrade. It wasn't until an unusually long amount of time had passed before we realized the auto-install was looping on retry failures and never had a hard error/escape to callout the problem in the UI. We're pretty sure it would have looped indefinitely had it been given the opportunity to do so. Still not getting a warm, fuzzy feeling on the quality of code upgrades.

Another tandem UCS domain supporting our stretch cluster configuration was upgraded as well from Version 4.2 (3k) to 4.3 (4e).

Both UCS domains are stable and the UCSM bug is hopefully behind us. I will be marking wko@cisco.com's earlier post as the 'Solution' to this thread discussion. Still waiting on that apology though. Still watching the bug IDs increment on the number of associated cases. Still thoroughly let down by Cisco's sheer unprofessionalism on the matter.