Solved: Network Based Application Has Trouble Opening During the Day

andres.lara · ‎01-22-2013

I work at a hospital and one of our department uses specialized software created by Varian Medical Systems. It has been brought to my attention that one of those applications has trouble opening during the day. I had the users demonstrate the issue and from what they explained to they are supposed to be able to open the application, login, and be presented with a list of radiology images to choose from. Unfortunately, during the day this fails often and they have to try 3 or 4 times before it actually works. It behaves differently after 4 pm, and seems to work after the 1st or 2nd time at that time of the day. According to what I've been told this has been an issue as long as they can remember. They asked me to take a look at it hoping that I could help.

Varian has told me that they have done a number of things on their side to rule out their software and they think it is a network issue. We used Solarwinds Engineer's Toolset (specifically the Network Performance Monitor) to monitor their switch and it is reporting no errors and the utilization graphs show that the ports involved have very little utilization. The most heavily utilized port is hovering between 10 and 5 person (Fa0/40). I've included a network diagram, but basically we have 1 10/100 Cisco 3350 switch (c3550-ipservices-mz.122-25.SEB4.bin), 4 clients, and 2 servers involved. They all are connected to the same switch at A-Full/A-100mbps. Although the Network Performance Monitor doesn't show any errors or overutilizaton of the ports in the the CLI I do see 35 output buffer failures and 35 underruns on the port connected to one of the servers (Fa0/40). They were a little higher and I cleared them about two weeks ago and then rebooted the switch, because I found that it could alleivate these types of errors.

They say the software use ports 5000, 55000, 55010, and 55020. We tried a packet capture, but I didn't have enough experience/knowledge to get anything useful out of it. I also checked the event logs on the clients and servers and nothing there indicates a issue in the software. They want us to replace the switch with a gigabit switch, but we have a REALLY limited budget and I would rather not if it isn't necessary. Do any of you have an suggestions as to what I could try in order to rule out the network?

stephen.stack · ‎01-22-2013

Hi,

OK, these ones are always fun. And a top-down troubleshooing approach is the best way to attack it. This is where you review and analyse from the application layer down to the network layer. You may have to guide the application vendor down the corect troubleshooting path here also, I would not take their word for anything at this point. (trust but verify)

So, what would I do...

I think first thing, is this happening accross all users?

If you launch the app on the App Server during the affected times, does the issue occur?

You are sure that the issue stops after 4PM?

I note an App and DB server in the diagram, ask the person/ vendor that manages the server to provide historical and even daily stats on Disk IO, Memory and CPU usage?

Any other apps running on either of these servers that can be checked against these times.

From a network perspective (in no paticular order), ensure the servers are hard coded 100/full on their NICs, and the switch ports are hard coded also.

Check each switch port for errors and discards.

Check CPU usage in the switch.

Any DNS issues at these times in AD. i.e. slow DNS lookups etc..

Check server reponse times using ping during the affected times

That might do it for now. Remember to gather as much information as possible before throwing any new equipment in to the mix. There is always a root cause, and extra bandwidth is rarely the answer

Regards

Stephen

==========================
http://www.rConfig.com

A free, open source network device configuration management tool, customizable to your needs!

- Always vote on an answer if you found it helpful

========================== http://www.rconfig.com A free, open source network device configuration management tool, customizable to your needs! - Always vote on an answer if you found it helpful

View solution in original post

Gregory Snipes · ‎01-22-2013

Imagery intensive applications generally produce very "bursty" (should be a word) traffic. These quick bursts of traffic may not show up on your monitoring tools because they get leveled out during the polling interval. This is somewhat supported by the errors you are seeing, especially if they are on the imagery server. It is highly likely that you would see an improvement by upgrading to a Gigabit switch. In addition to a higher line speed, Gigabit switches generally have larger port buffers that can absorb and level out the surges.

Best Regards,

Greg

stephen.stack · ‎01-22-2013

Hi,

OK, these ones are always fun. And a top-down troubleshooing approach is the best way to attack it. This is where you review and analyse from the application layer down to the network layer. You may have to guide the application vendor down the corect troubleshooting path here also, I would not take their word for anything at this point. (trust but verify)

So, what would I do...

I think first thing, is this happening accross all users?

If you launch the app on the App Server during the affected times, does the issue occur?

You are sure that the issue stops after 4PM?

I note an App and DB server in the diagram, ask the person/ vendor that manages the server to provide historical and even daily stats on Disk IO, Memory and CPU usage?

Any other apps running on either of these servers that can be checked against these times.

From a network perspective (in no paticular order), ensure the servers are hard coded 100/full on their NICs, and the switch ports are hard coded also.

Check each switch port for errors and discards.

Check CPU usage in the switch.

Any DNS issues at these times in AD. i.e. slow DNS lookups etc..

Check server reponse times using ping during the affected times

That might do it for now. Remember to gather as much information as possible before throwing any new equipment in to the mix. There is always a root cause, and extra bandwidth is rarely the answer

Regards

Stephen

==========================
http://www.rConfig.com

A free, open source network device configuration management tool, customizable to your needs!

- Always vote on an answer if you found it helpful

========================== http://www.rconfig.com A free, open source network device configuration management tool, customizable to your needs! - Always vote on an answer if you found it helpful

Joseph W. Doherty · ‎01-22-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

From what you've described, it does sound like transient congestion is being adverse to this particular application. The fact that it's only seen during prime business hours points to transient congestion. If it's just this one application that's impacted, this often points to the application itself (some applications don't well tolerate any network degradation) or the hosts that support the application. (Some OSs or NICs, especially with out-of-date drivers or OS patches, may not tolerate high loads).

Unfortunately, determining actual cause of this kind of issue can be very difficult. Often the "easiest" method is the "replace X", problem resolved? If it is, problem was "X". If not, "replace Y". . .

BTW, don't discount the possibility the problem lies within the Application. Often such vertical market vendors don't, or are unable to, stress test their application. They often avoid updating drivers or the OS being used, as "if it ain't broke, don't fix it". To be fair, upgrades or patches can break things too, but when drivers or OSs are patched, it's often for some discovered defect. So, it's the devil (they think they) know vs. the devil they don't know.

andres.lara · ‎01-30-2013

From what I understood the application was failing when accessing the database. Once they access the database they could then select a set of images and then view them.

I asked the department a few of the suggested questions and I found that all users were affected, but it was that specific application. There were other applications, used by that department, that were developed by the same vendor that worked fine.

The vendor didn't provide us with any statistics. They seemed to have kind of washed their hands of it and insisted that they've checked everything and had done everything they could. Both servers used Windows 2003 Server, so we enabled the Perfomance logging.

The Perfomance monitor gave us a lot of useful information. We were able to identify the hard disks as being the bottle neck. The vendor adjusted somethings on the server and we are waiting to see if has resolved the problem. Luckily the department was going to upgrade those servers in June anyway, so maybe now that they see that the servers are the bottle neck they will do it sooner.

We had checked the switch pretty throughly and we didn't see any problems. If something like this comes up again I will try hard coding the interface and the server NIC.

THANK YOU all very much. I really do appreciate your help.