 
					
				
		
04-29-2015 08:31 AM - edited 03-04-2019 02:26 AM
Hi Folks,
We've been a UCS customer for about 9 months. We have been steadily replacing HP Gen6/7/8 blades with UCS B200 M3 and B200 M4 blades as well as experiencing a large amount of growth, we are at 197 M3/M4 blades at the moment, 6148 and 6196 Fabric Interconnects. We had an experienced systems integrator install each cluster (we have 5 clusters) and the design aspect is solid. We are on 2.2(3c).
We have been experiencing an abnormal amount of failures and bugs across multiple data centers and clusters:
- Manufacturing defects on M3 blades causing abnormally high memory failures and server crashes (several per week, across data centers)
- DOA Toshiba hard drives - they go DOA when a bug in the FI code tries to downgrade the drive firmware to a level not compatible with the drive, rendering the drive and blade useless until swapped with Seagate drives.
- M4 MegaRaid controllers with invalid FRU information causes server reboot: We have some controllers with a manufacturing date of '255/255/255' or serial number that looks like Egyptian hieroglyphics
- M4 Servers randomly reboot and go into deep discovery, coming back online after 10 minutes
- LDAPs stops working on UCSM and SSH, can't log in to UCSM
I can say that our HP blades were not perfect but in the 7 years we've been running HP blades (well over 150 blades) we've never run across and hit as many nasty bugs and manufacturing issues as we've had in past 9 months with UCS. I had a chance to buy UCS in 2009 and passed on that option due to how new the platform was, I would never had imagined having so many teething issues with a platform that has been out for ~6 years.
If you are a large(ish) customer can you comment on your experience with the general stability of UCS?
Thanks,
Ben
 
					
				
		
05-02-2015 01:28 PM
Ben,
I am a Cisco employee and you know what I am going to say about our own product so my question is (while we wait for other users to comment), have you reached out to your Cisco rep to align resources pre-deployment who can make the transition as error free as possible? there are some issues you cannot prevent but there measures that can be taken to avoid large issues.
The other question is, how about your support experience? have you had bad experiences with TAC support as well (compared against the prev support you had in those 7 years)?
-Kenny
 
					
				
		
05-13-2015 02:20 PM
Kenny,
I have a 3x weekly call with top Cisco resources (we are talking to 2 sales guys, 1 senior internal engineer, 1 project manager, 1 professional services person and a few others) regarding these issues.
Since my post 14 days ago here are our new issues:
- Production M4 blade crashes due to memory issues
- Brand new M4 blade from the Cisco factory inserted and threw a triple DIMM error
I find it hard to believe we are the only Cisco UCS customer experiencing this level of instability.
Ben
05-26-2015 05:26 AM
We also have too much Memory issues with our UCS B200- Blades.
We do never have Memory issues with any other brand of Hardware here.
There can only be two reasons here:
1. Huge Quality Issue
2. FW Bug regarding Memory health
 
					
				
		
10-08-2015 06:23 AM
It's been a while so let me update with our status:
- Cisco RMA'ed 114 M3 blades with defective memory sockets, we received new blades
- Some new M4 blades we received with defective memory, the issue was that some group of people at the factory were damaging the memory modules during insertion. Cisco decided to build a device that helps the factory workers install the memory modules correctly. Rookie manufacturing mistake Cisco, how long have you been building blades???
- We had 5 servers reboot on one domain when we lost connectivity to the default gateway for the FI management interface. Cisco could not figure out why this happens, makes no sense.
- We recently had to hard power cycle an ESXi server with about 20 VMs on it due to a "faulty" megaraid controller. The server was online but had to be force crashed in order to get the VMs off the server due to the megaraid issue.
Quality gear!
 
					
				
		
01-26-2016 07:46 AM
I have received a brand new 5108 Chassis with eight new B200 M4 blades. We had 5 out of the eight blades reporting fatal memory errors right after installing them .The memory that was installed is 16 if the 32GM DDR4 DIMMS per blade. I have never had this many issues with deploying brand new servers before. This seems to be a issue with quality control? This seems to be an issue of not testing the blades with the installed memory long enough. The issue appears after VMware is loaded and/or firmware is updated.
03-23-2016 03:07 AM
We are having the same issue with a large number of memory DIMMs either failing or failed from installation. We have 1 blade in particular that has had DIMMs, Systemboard and CPU replaced (Would like Cisco to replace the whole blade and fix/repair the old one in their time given the man hours we have put in to troubleshooting and repairing it but no such luck yet).
We have had several replacement DIMMs fail or just not work. Cisco have gone some way to explain a large number of failures (Correctable ECC Error) and provided us a link to a document - which may be of some help/use to others experiencing similar/same issues.
There is a whitepaper C11-736116.pdf
05-03-2017 02:58 PM
First, I don't ever want to diminish the impact of another customer's experience. Sounds like you've had a pretty rotten time of it, and all I can say is that I hope Cisco is doing right by you to get things fixed. Ideally, your frustrations will only motivate them to improve.
Second, my experience has been very different. We started a pilot about 18 months ago to test UCS and ultimately to replace Dell rack servers with B200 M3/M4 blades. Today we have well over 200 blades across 17 domains. We now have more Cisco than Dell in the environment. We've had a handful of memory failures, and one or two system board or CPU hardware failures, and the RMA process for those has always been smooth. Nothing out of the ordinary.
On a firmware level, we had some compatibility issues with a Cisco driver in VMware causing intermittent storage problems, which took us quite a while to identify. In the end it was a bug which we were the first customer (to our knowledge) to discover, and it once finally identified, the firmware was quickly patched for us.
Has the experience been perfect? Nope. But it's definitely been an upgrade to what we had with Dell, not to mention the added value that Cisco software gives us from an automation perspective. Based on our experience alone, I highly recommend UCS to enterprise customers.
 
					
				
		
01-30-2018 10:19 AM
Don't use the "High Performance" memory option which runs the memory controller and ram at the highest voltage. That should provide long-term reliability and eliminate a lot of RMA's.
You've got to be prepared to bleed to be bleeding edge.
 
					
				
		
01-30-2018 01:03 PM - edited 01-30-2018 01:31 PM
This is quite shocking to hear. I'm wondering where are you getting your equipment from US, ASIC/PAC, or Europe?
I'm in the US and just stood up a 20 chassis enviroment with 160 B200M5 servers and only had 5 servers that where either DOA or had memory issues. We are running UC 3.1(3e) with really no issues. Next week we'll be standing up DC2 with the same amount of chassis and servers. I'll see if anything has changed.
Really sorry to hear about your issues.
 
					
				
				
			
		
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide