08-24-2005 10:46 AM
We are running backups over the SAN. Several client hosts HBA's are connected via the SAN to the an HBA on the backup server. All HBA's are assigned IP addresses. All hosts are Solaris 10. The HBA's are Emulex LP9002 with 6.01f driver and 3.92a3 firmware. The switches are Cisco MDS with 1.3(4b) code.
In the last several weeks, we have begun experiencing problems. The backup will be running fine, but then will time out with the following message ... ANS1017E (RC-50) Session rejected: TCP/IP connection failure
After receiving the message, the operator will try to ping the backup server over the SAN, and the ping will fail.
When I look at the switch, I see a solid connection and no errors. I can issue an FCping command from the switch to the HBA's and will get successful replies. If I reset the port on the switch, the service will be restored, but will eventually fail again (although it may be several days before it fails again), and all the while I see no link errors on the switch. We have had the failures occur during periods of both heavy and light SAN traffic. Often one host will fail while another continues without any problem.
Anyone have any ideas on what might be causing these failures ?
08-25-2005 11:34 AM
I would think that this is an issue with the Emulex cards (firmware or driver). The switch just carries the FC traffic - it doesn't care what's encapsulated inside of it...the Emulex cards do the IP work. Since you say that you can successfully FCping both interfaces and there are no errors on the interfaces, then the switch is doing its job perfectly. In other words, the TCP/IP session didn't fail because of some sort of FC service disruption in the switch. The reason it solves the issue (probably) temporarily when you reset the ports, is that the HBAs are then forced to re-login to the fabric.
I would definitely suspect the HBAs first, and seek support from Emulex. I noticed that there are newer driver and firmware revisions on their site. Perhaps they address the issue?
08-25-2005 09:16 PM
Thanks for your thoughts. That is what I was thinking. We upgraded the driver on one of the servers to 6.02F today and are currently running a test backup. It's been running without interuption for about 5 hours now. So far so good.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide