During the last month, we've found a problem in some CM servers...
What happens is that customer reports that something goes wrong with one of the CM servers of the cluster, and we verify that:
- there's no web access (or sometimes when we try to access the CM IP address, instead of the usual web page we get another one with just the word Platform, but if we try to follow the link, we get a tomcat error - HTTP Status 404 - /iptplatform - Apache Tomcat/5.5.28)
- there's no SSH access
- using a keyboard and monitor, we see nothing displayed
- it replies to ping requests (this is the only thing that works)
- telephones register to the next CM server in the CM list
- SIP traffic sent to this server is lost... (it seems that none of the services is up and running)
Up until now, we've faced this in 7 servers:
Cluster 1 - Pub CM v 184.108.40.2060 7816I4
Cluster 2 - Pub CM v 220.127.116.1100 7825I4
Cluster 3 - Pub and, after that, Subs CM v 18.104.22.1680 7825I4
Cluster 4 - Pub and, after that, Subs CM v 22.214.171.1240 7825I4
Cluster 5 - Pub CM v 126.96.36.19900 7825I4
As you can see CM version is not always the same, but all of them are IBM I4 servers.
This is the only relationship we've detected among all of them, as we have installed many other clusters with other servers type, and they are not showing up this issue.
Usually we recover the server by rebooting or, if it doesn't work, using the recovery disk, but we are afraid of it being a hardware bug that could be repeating after some time or happening in other new deployments.
Has anyone faced something similar?
Does anyone know about any problem with these platforms?
Thanks in advance
Just in case it helps anyone else...
We've found that there are two bugs already open for this issue:
- See CSCti58651 if you encounter this issue on a MCS-7816-I4.
None of them is solved, and the workaround provided is the same I stated previously.
Carmen is right on with those two defects. Lots of customers are seeing this with the 7816, 7825, and 7828 I4 servers.
A new firmware patch for the hard drives on these servers was posted over the weekend. While we aren't yet calling it fixed we strongly believe the firmware will help. See the document linked in Carmen's post for details.
Ok so I need to update if any of my servers have 3b04 OR 3b05 firmware on the harddrives with the referenced model numbers? The example output shows 3b04 but I could swear it used to show 3b05