cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3922
Views
35
Helpful
9
Replies

Cisco UCS blades are unreliable, unstable and buggy.

Ben Conrad
Level 1
Level 1

Hi Folks,

We've been a UCS customer for about 9 months.  We have been steadily replacing HP Gen6/7/8 blades with UCS B200 M3 and B200 M4 blades as well as experiencing a large amount of growth, we are at 197 M3/M4 blades at the moment, 6148 and 6196 Fabric Interconnects.  We had an experienced systems integrator install each cluster (we have 5 clusters) and the design aspect is solid. We are on 2.2(3c).

 

We have been experiencing an abnormal amount of failures and bugs across multiple data centers and clusters:

- Manufacturing defects on M3 blades causing abnormally high memory failures and server crashes (several per week, across data centers)

  • There is no fix for this issue except to RMA the blades

- DOA Toshiba hard drives - they go DOA when a bug in the FI code tries to downgrade the drive firmware to a level not compatible with the drive, rendering the drive and blade useless until swapped with Seagate drives.  

  • There is currently a fix for this bug, new FI code.  
  • Was not available when we hit this bug.

- M4 MegaRaid controllers with invalid FRU information causes server reboot:  We have some controllers with a manufacturing date of '255/255/255' or serial number that looks like Egyptian hieroglyphics

  • Causes servers to reboot when FI mgmt fails over or is upgraded with new code
  • The fix is go get into the MegaRaid firmware and force new values or RMA the controller
  • FI firmware incorrectly evaluates the firmware values with a shallow discovery and then performs a deep discovery, bug fix is in the works

- M4 Servers randomly reboot and go into deep discovery, coming back online after 10 minutes

  • Another issue with FI firmware incorrectly evaluating valid MegaRaid controller firmware values, bug fix is in the works
  • Applying the bug fix (when available) will cause additional M4 servers to randomly reboot (fun).

- LDAPs stops working on UCSM and SSH, can't log in to UCSM

  • a regression in our current code (LDAP daemon crash), no fix is currently available.  Work around provided which is to slightly modify the LDAP config.  If we simply failed over the mgmt server (which does fix the issue) we'd cause a FRU reboot (yep, that happened) mentioned above

 

I can say that our HP blades were not perfect but in the 7 years we've been running HP blades (well over 150 blades) we've never run across and hit as many nasty bugs and manufacturing issues as we've had in past 9 months with UCS.  I had a chance to buy UCS in 2009 and passed on that option due to how new the platform was, I would never had imagined having so many teething issues with a platform that has been out for ~6 years.

If you are a large(ish) customer can you comment on your experience with the general stability of UCS?

Thanks,

Ben

9 Replies 9

Keny Perez
Level 8
Level 8

Ben,

I am a Cisco employee and you know what I am going to say about our own product so my question is (while we wait for other users to comment), have you reached out to your Cisco rep to align resources pre-deployment who can make the transition as error free as possible? there are some issues you cannot prevent but there measures that can be taken to avoid large issues.

The other question is, how about your support experience? have you had bad experiences with TAC support as well (compared against the prev support you had in those 7 years)?

 

-Kenny

Kenny,

I have a 3x weekly call with top Cisco resources (we are talking to 2 sales guys, 1 senior internal engineer, 1 project manager, 1 professional services person and a few others) regarding these issues.

Since my post 14 days ago here are our new issues:

- Production M4 blade crashes due to memory issues

  • approx 20 virtual machines crashed
  • Sent entire blade back to Cisco via FACT return

- Brand new M4 blade from the Cisco factory inserted and threw a triple DIMM error

  • Can't turn on blade.
  • Resolution in progress...

I find it hard to believe we are the only Cisco UCS customer experiencing this level of instability.  

Ben

We also have too much Memory issues with our UCS B200- Blades.

 

We do never have Memory issues with any other brand of Hardware here.

 

There can only be two reasons here:

 

1. Huge Quality Issue

2. FW Bug regarding Memory health

 

 

 

 

It's been a while so let me update with our status:

 

- Cisco RMA'ed 114 M3 blades with defective memory sockets, we received new blades

- Some new M4 blades we received with defective memory, the issue was that some group of people at the factory were damaging the memory modules during insertion.  Cisco decided to build a device that helps the factory workers install the memory modules correctly.  Rookie manufacturing mistake Cisco, how long have you been building blades???

- We had 5 servers reboot on one domain when we lost connectivity to the default gateway for the FI management interface.  Cisco could not figure out why this happens, makes no sense.

- We recently had to hard power cycle an ESXi server with about 20 VMs on it due to a "faulty" megaraid controller.  The server was online but had to be force crashed in order to get the VMs off the server due to the megaraid issue.

Quality gear!

 

 

 

jbennetsen
Level 1
Level 1

I have received a brand new 5108 Chassis with eight new B200 M4 blades. We had 5 out of the eight blades reporting fatal memory errors right after installing them .The memory that was installed is 16 if the 32GM DDR4 DIMMS per blade. I have never had this many issues with deploying brand new servers before. This seems to be a issue with quality control? This seems to be an issue of not testing the blades with the installed memory long enough. The issue appears after VMware is loaded and/or firmware is updated.

andrew.north
Level 1
Level 1

We are having the same issue with a large number of memory DIMMs either failing or failed from installation. We have 1 blade in particular that has had DIMMs, Systemboard and CPU replaced (Would like Cisco to replace the whole blade and fix/repair the old one in their time given the man hours we have put in to troubleshooting and repairing it but no such luck yet). 

We have had several replacement DIMMs fail or just not work. Cisco have gone some way to explain a large number of failures (Correctable ECC Error) and provided us a link to a document - which may be of some help/use to others experiencing similar/same issues.

There is a whitepaper C11-736116.pdf

http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-manager/whitepaper-c11-736116.pdf

7mwolcott7
Level 1
Level 1

First, I don't ever want to diminish the impact of another customer's experience.  Sounds like you've had a pretty rotten time of it, and all I can say is that I hope Cisco is doing right by you to get things fixed.  Ideally, your frustrations will only motivate them to improve.

Second, my experience has been very different.  We started a pilot about 18 months ago to test UCS and ultimately to replace Dell rack servers with B200 M3/M4 blades.  Today we have well over 200 blades across 17 domains.  We now have more Cisco than Dell in the environment.  We've had a handful of memory failures, and one or two system board or CPU hardware failures, and the RMA process for those has always been smooth.  Nothing out of the ordinary.

On a firmware level, we had some compatibility issues with a Cisco driver in VMware causing intermittent storage problems, which took us quite a while to identify.  In the end it was a bug which we were the first customer (to our knowledge) to discover, and it once finally identified, the firmware was quickly patched for us.

Has the experience been perfect?  Nope.  But it's definitely been an upgrade to what we had with Dell, not to mention the added value that Cisco software gives us from an automation perspective.  Based on our experience alone, I highly recommend UCS to enterprise customers.

Don't use the "High Performance" memory option which runs the memory controller and ram at the highest voltage. That should provide long-term reliability and eliminate a lot of RMA's.

 

You've got to be prepared to bleed to be bleeding edge.

Rick1776
Level 5
Level 5

This is quite shocking to hear. I'm wondering where are you getting your equipment from US, ASIC/PAC, or Europe? 

 

I'm in the US and just stood up a 20 chassis enviroment with 160 B200M5 servers and only had 5 servers that where either DOA or had memory issues. We are running UC 3.1(3e) with really no issues. Next week we'll be standing up DC2 with the same amount of chassis and servers. I'll see if anything has changed.

 

Really sorry to hear about your issues. 

Review Cisco Networking products for a $25 gift card