cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2599
Views
23
Helpful
29
Replies

CSCwk91747 - VFC Flap during FI Reboot

This is an appalling bug.  This bug ID does NOT exist in UCS 4.3 release notes!  To add insult to injury, UCS 4.3 (4a) is a 'gold star' recommended version at the time of writing!   WHY!?  We have had detrimental fallout as a result of this bug and still have an identical UCS domain to bring up to the same version.  This is gross negligence on the part of Cisco keeping customers informed of critical bugs.

For salt in the wound, the bug page currently reflects a "fixed" status with no documentation of a fixed version.  No workaround provided either.  This needs to be a Sev-1 bug.  Created July 30th with no additional information, and pushing a golden image firmware version with this bug?  Shame Cisco, shame!

This bug needs priority attention, now.

1 Accepted Solution

Accepted Solutions

wko@cisco.com
Cisco Employee
Cisco Employee

CSCwm30262 and CSCwk91747 have the same root cause in UCSM code base.  The fix has been verified and released in UCSM 4.3(4e) and 4.3(5a).

 

The defect was introduced to the internal UCSM 4.3(4) code base, escaped our testing and released in UCSM 4.3(4a).  The effect of this defect is a 3-5 seconds FC traffic interruption from an interface bounce on both Fabric Interconnect (FI) switches when UCSM programs a vFC interface of a 6400- and 6500-series FI.  FI reboot, VLAN id changes, certain UCSM process restart, and CIMC reset can run into this defect.  For OS and applications that are I/O sensitive, this 3-5 seconds interruption are severe to catastrophic.

 

The fix is in UCSM code.  Since UCSM domain update is to update UCSM code first followed by NXOS, updating to a build with the fix means the fix will take effect before a FI reboot.  Hence the domain upgrade to the fixed build will not encounter the issue.

View solution in original post

29 Replies 29

marce1000
VIP
VIP

 

                        >...This bug needs priority attention, now.
  - I understand your frustrations , this group discusses 'bug experiences' amongst customers only.
    Failing , incomplete bug reports , such as not able to point to a real fixed versions and others
                       must always be directed towards TAC

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

"Frustrations" does not begin to describe it....

Cisco does have representatives monitoring these discussions.  These discussions are linked back to respective Bug IDs.  TAC has already gotten more than a keyboard-full of rhetoric on the subject.  We are pursuing escalation and resolution through the appropriate channels (including TAC) and will continue do so outside of this discussion forum.

But I digress with the 'bug experience':

Total chaos.  ESXi hosts experiencing sporadic total storage connectivity loss across multiple storage devices.  Spontaneous VM workload failures and reboots.  Impartial VMware HA recoveries.  Disruption to internally-hosted load balancing services.  Impacts to costly enterprise Oracle RAC environments.  The flapping of storage connectivity was so intermittent and erratic that ESXi hosts had to rate-limit their own event generation in their logs.  To make matters worse, much of the alerting/alarming that would have clued one into an issue resulting from the upgrade process was masked by the problem impacting services supporting email and SNMP.  OH...and what bug experience would be complete without reliving it twice because it occurred in two waves of the firmware upgrade process.  Not to mention the thousands of dollars in after-hours human efforts for recovery of systems and chasing a root cause for a sudden cataclysmic event, which was the result of an otherwise routine and successful infrastructure firmware upgrade.

The best bug experience of all is the irreparable damage to trust and reputation.  For over a decade, I, "System Administrator" have held pride in performing UCS firmware upgrade after UCS firmware upgrade, seamlessly at infrastructure level, across multiple UCS domains.  Our internal team has earned a reputation of dependable infrastructure maintenance without impacts to workloads.  We sell the promise to leadership that UCS is a system worth the cost because of the flexibility, stability, and redundancy it provides.  We invest in local, robust, on-prem cloud constructs with the idea that we have the capability because of systems like UCS providing the foundation to do so.

However, our 'bug experience' has shattered that trust and reputation.  We now need to answer to our sys admins and our leadership teams.  We must provide reasoning for why a time-proven infrastructure firmware upgrade has resulted in a revenue-impacting outage.  I must come up with the explanation for why a 'golden' firmware version contains such an debilitating, undisclosed (release notes) bug.  I must now wear the dis-honorary 'scapegoat' badge for why KPIs were tarnished.

Never again will we trust that infamous 'gold star' icon next to a UCS firmware version.

Gold Star.jpg

Never will we read a books-worth of release notes and expect to be properly informed for making an upgrade decision.  

 

To that end (of a rant), I am truly curious if any others have experienced this bug in particular and any information TAC may have provided you in response.  We were initially told to simply 'subscribe to the Bud ID for updates.'  Surely we don't pay our support contracts with the expectation of being told to stay tuned to a Bug ID page (especially when that bug is purportedly fixed).

We were in the 4.2 (2) space for UCS infrastructure firmware prior to the ill-fated upgrade to 4.3 (4a).  At this moment, we have not heard back on if this bug is truly fixed.  We suspect that because we are now operating one of our UCS domains on a bugged version, we will encounter this bug again when it comes time to upgrade to a 'fixed' version. It should be entertaining (in sadistical form) to explain why we need to migrate or otherwise shutdown 100% of a data center's virtualized workload to safeguard it from an infrastructure firmware upgrade.

Mild developments:

We have successfully updated another UCS domain at the infrastructure level from 4.2 (2a) to 4.2 (3k) without fallout (the way it is suppose to be).  We only ran into a benign issue in which the FIs could not complete an automatic backup as part of the auto-install process.

Interestingly enough, the Bud ID page for CSCwk91747 has been updated.  Most importantly, the bug severity level now reflects "1 Catastrophic" which is much more in line with the serious fallout from the upgrade process.  We have also noticed that CSCwk91747 now includes all four 4.3 (4) versions.  Still disconcerting is the fact that the Bug ID page still reflects a status of 'fixed.'  Furthermore, all versions of 4.3 (4) remain available for download, with 4.3 (4a) still being the 'golden' version.

Several parties have now been tagged on the Cisco side.  We expect some action and reaction to occur when regular business hours resume Monday.

Additional developments:

There have been some additional details documented on the Bug ID page fo CSCwk91747.  Namely, additional symptoms (or actions that instigate the bug) are listed.  It is disconcerting to see that even just adjustments at the vNIC/vHBA level can cause this bug to rear its ugly head at the indivdiual system level.

The number of cases referencing the bug has gone up to a count of six.  There is also 'workaround' information, though it appears to only apply to individual operating systems impacted by the bug.  I assume with the statement of "if OS is hung," Cisco is referring to either a virtual system operating on the UCS platform or a physical system that is a boot-from-SAN configuration.

For the first time, a TAC escalation manager has offered an official recommendation of downgrading to a 4.3 (3) release to avoid the bug.  I've been told the bug is reportedly fixed, but that the 4.3 (4) release which will correct this code causing this bug has not yet been released.  This is purportedly being released by the end of the month.  ¯\_(ツ)_/¯  It would be nice if bug status indicators were more clear.  Simplying labeling a bug with a 'fixed' status isn't helping convey a proper message.  How about a 'Fixed - pending release' at least?!

Either way (a downgrade or future upgrade) will impact our operations and likely require us to somehow isolate all workload to perform the upgrade and prevent storage-disruption impacts.

Update: New Firmware Just Released

Looks like the 4.3 (4) version branch is being abandoned entirely for 4.3 (5).  Version 4.3 (5a) was just released.  Here are the corresponding release notes (which now also call out CSCwk91747 appropriately as an open caveat in the 4.3 (4a) section: https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/release/notes/b_release-notes-ucsm-4_3.html#ucsmRN_435a_resolved

The Bug ID page will likely call out 4.3 (5a) as the fixed version sometime soon.

Is our organization willing to continue being test subjects by moving to a bleeding edge firmware release?  Probably (definitely) not.  We are currently pursuing a plan to downgrade to 4.2 (3k) and forego some mildly-needed software enhancements in 4.3 for the time being.  Planning for this downgrade will not be easy for us, as we'll need to gracefully take down 100% of the compute workloads that will be impacted by FC flapping.

The only positive impact thus far may be the increased likelihood in convincing our leadership to invest more in lab infrastructure to mirror the foundation of our production environment.  *fingers crossed that I, 'System Administrator,' gets a new hardware budget*

jonmiller5555
Level 1
Level 1

Thanks for sharing this information and all the updates.  I recently upgrade from 4.3(2c) to 4.3(4c) and had this issue. It took TAC a little bit of time, but they linked it to this bug. Disconnecting storage to 1000 guest VMs for 3 seconds was disastrous as you can image, we have found corruption on around 100 servers so far that we had to fix or restore from backup.

My big issue now is how will I fix this non disruptively going forward? Assuming a new firmware version has a fix, will it cause the same condition rebooting the FIs to get them upgraded? I'm hoping Cisco is able to post more information about this soon.

@jonmiller5555:

Our escalation with our regional Cisco reps and one of our VARs has turned up the heat a little bit.  When we first encountered the issue, our internal support teams targeted the issue as something complete unrelated to the UCS firmware upgrade.  Though at the time there was an unknown event coinciding with our UCS infrastructure firmware upgrade, we initially couldn't fathom that the UCS firmware upgrade was to blame.  Regardless, we err on the side of caution and opened an SR with TAC that evening to review what was otherwise a clean firmware upgrade.  On the following day, we relayed more information we were finding that pointed to an FC issue.  Eventually TAC made the correlation with that additional info.

As for fixing this, the latest I've heard directly is that we are pretty much guaranteed to experience this bug again regardless of an upgrade (to a fixed version) or a downgrade (to a known clean version).  This is partly because one of the instigating actions is the act of a reboot of either FI.  That technically means that performing an upgrade/downgrade will still allow for this bug to present itself when the FIs reboot to activate the firmware.

For us, we operate a metro cluster configuration, so we have two data centers configured identically and operate them as one logical data center.  We'll be able to vMotion a good portion of our 'floating' workload to our other data center.  The problem will be the several 100 remaining systems we have bound to the one impacted data center.  The only clear way to guarantee we won't corrupt those systems with FC flapping is to take a total outage of those systems and shut them down prior to the upgrade/downgrade.  Needless to say, doing so won't be an easy feat for us and will take considerable effort in planning such an outage for the firmware upgrade/downgrade.

Unless you're 1000+ VMs have a second home they can move to, I imagine you're going to have a harder time.  It would be nice to know some more specifics as to what 'code' in the FIs is causing this issue and why it can't otherwise be 'repaired' in-place to fix the problem without jumping to another version altogether.  I would much rather prefer an in-place patch of sorts...however, I can also see were implementation of such a code fix would likely require the FI to be restarted, which will still allow for the bug to impact FC traffic.  Find me between the rock and the hard place...

I assumed the same, all VMs will need to power down. Unfortunately we are not in a stretch cluster between our data centers, so I can't easily move them and don't have enough compute to run all of our workloads in one data center anyways. I did warn my team earlier today not to reboot the FIs at all, but we really only do it for upgrades anyways.

Our local Cisco reps have been helping with this issues as well, hopefully more to come soon.

Speaking of a "no FI reboots" declaration...one thing we did hear on the backend but did not see publicly disclosed yet was reference to a similar bug with relation to modifying VLANs.  One of the symptoms now listed on the Bug ID page is: "Any changes related to vNIC or vHBA to the service profile associated to the server (Impacts that specific server for which the Service profile is modified)"

My interpretation of this undisclosed similar bug is that there could be an issue with adding/modifying/removing VLANs, particularly widespread if your vNICs rely on templates.  If "any change to vNIC" for this bug can be instigated by adjustments to VLANs as well, you'll need to tread more carefully on any changes you make that could replicate to vNIC and vHBA configurations.  We use vNIC templates.  I know when we add a new VLAN to a vNIC template, its an automatic discovery on the attached vNICs to update them with the new VLAN config.  This could be then (in theory) be an 'any change to vNIC' and result in the bug impacting FC traffic on those systems that had a vNIC updated.

 

Internally, we've essentially called a moratorium on ANY changes in our bugged UCS domain until we've moved to a clean version.

kevin.robson.1
Level 1
Level 1

encountered this bug as well doing infrastructure upgrades to 4.3(4a). no mention of it in release notes like you said. nothing. Gold star firmware release, still active on download page.

JoeJordan3877
Level 1
Level 1

We just rolled out 4.3.4c the past two months and ran into this issue.  As the original poster stated, nowhere in the release notes was this mentioned.  But all of a sudden it is listed along with the 4.3.5a firmware release notes as fixed and is listed under 4.3.4a as an open caveat.  I read the release notes up and down to be sure we don't run into these types of issues.   

There is another open caveat, not there previously, listed under 4.3.4a that causes vHBA issues if you add a VLAN and it doesn't state it is fixed. CSCwm30262 : Bug Search Tool

I have escalated this issue to my Cisco team also.  They need to pull these firmware releases immediately.

@JoeJordan3877:  Thanks for calling out CSCwm30262.  I think that is exactly in line with my previous response to @jonmiller5555 about avoiding more than just FI reboots.  It is very interesting to see DME called out as a culprit with that bug.

And you certainly shouldn't discredit your due diligence in review release notes.  As I first started this thread, those bugs were NOT listed.  I beat myself up for awhile thinking we had missed something.

I recommend that everyone reach out to their local/regional Cisco reps and escalate accordingly.  Its now evident that far more are falling into this bug trap than even I was first led to believe.  It is disconcerting to hear of others experiencing similar uphill battles with TAC.  One of the things we should have done was insisted on escalating our SR to at least Sev-2...but at the time, we just didn't believe UCS could have been to blame.  Sev-2 and above will trigger not only greater priority but should open additional escalation channels behind the scenes.

@System Administrator can you check your PM's?