cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

Step 1: where what how?

1389
Views
0
Helpful
0
Comments
Cisco Employee

Hello again,

We're onto the next step in fixing a potential WAAS issue. This is assuming that you have not just removed the WAAS from the picture as outlined in the last post.

This post will be light on technical matters because as a first step we need to find out more about the problem. The best way to solve complex problems like this is having a methodology, in the TAC we're great fans of Kepner-Tregoe and often use this process to guide our questions. But any method will be useful.

We do need a method because there are simply too many components, users and systems involved to search randomly.

I'm going to outline a few thoughts on finding the problem, but these are of course no substitute for a "real" root cause analysis method.

  • Who is having the problem? If for example all users at a certain branch are impacted, but other branches are using the same systems on the same servers are working fine, then it is likely that the problem is located at the branch. If for example all users of a certain datacenter have impact for all protocols then the problem is more likely to be at the datacenter. We need to find out a minimal from where to where that is showing the issue.
  • What application or protocol is having the problem? If for example http traffic is working, but cifs traffic is not, then this could be an indication that the CIFS AO is having some issue. It could also be that the server is not working however. So when you compare protocols try to find working and non-working protocols that are talking from the same client to the same server, so we don't accidentally think that the difference is the protocol while in fact we are talking to a different server on another continent.
  • From when are we seeing the problem? Did someone test this before we noticed the problem? Often a 'critical new problem' turns out to be a new untested protocol that someone thought would create no problems...
  • Sometimes the problem is periodic which can be very interesting to correlate to other events.
  • How frequent is the problem? Are connections failing all the time, sometimes, it only happened once (but to the CEO), etc.

There will be a lot of pressure, not the least from yourself, to "just get on with it" and stop asking useless questions.

Resist that pressure. Try also to detect people that are glossing over the facts by asking details. For example if this was tested before: when, by whom, what were the results of the tests etc. People sometimes think something was done in a certain way, but are in fact wrong.

Another thing to watch out for is concentrating on the people having the problem. You should also ask yourself "who could have the same problem but is not complaining". You need to ask those people if they are seeing the problem and if not, try to find out what makes the people with the problem different from the people without the problem.

We're often so concentrated on fixing the problem at hand we forget to check the things that still work. Personally I find this one of the hardest things to remember, so it was written on my wall for some time...

In the ideal case you should now have a clear problem description with steps to reproduce the problem at will. Something like:

All users of the \\SRV42\foobar share running on server SVR42 in the London datacenter cannot open the share. This was tried from several other sites and opening the share works for all non-WAAS enabled sites and in the datacenter itself, but always fails for the WAAS enabled sites.

We first noticed the problem yesterday soon after the creation of the \\SRV42\foobar share. The SRV42 is new from yesterday too and is running Windows Server 2008R2 64bit.

We can open the remote desktop (RDP) on that server from the WAAS enabled sites and we can also browse the website on that server from the WAAS enabled sites.

Clearly we have narrowed down the problem a lot already, and from a state of "the sky if falling! panic!" we now have a limited problem on a new and untested system with only one protocol.

Please note that this will not only guide you in your search, but also make your manager happy (or at least happier). We now have a clear problem description with a defined scope and a direction for the problem analysis to take.

There are some tips and thoughts that you can use in finding this description:

  • Only IPv4 tcp connections are touched by WAAS, if ping or DNS does not work anymore it is most likely not a WAAS issue**
  • the AO's are relatively independent, if HTTP AO works it can very well be that CIFS AO fails. However all AO's are based on the generic DRE/TFO system, so if you have a failure in that all protocols will be affected
  • WAAS works in pairs both need to work correctly
  • Every connection that is handled by WAAS is independently discovered and accelerated and so might involve different WAAS devices with different policies.
  • WAAS will drop any packets it sees twice, as this is a sign of a routing loop. This cannot be changed.
  • Knowing what should work and what does not is key in this. If ping does not work, but you have no idea if this is normal then you cannot base any conclusions on this.
  • Network device and link outages can divert traffic from the 'normal' path and could take unexpected paths that then cause problems. For example with traffic now flowing twice across routers doing WCCP redirection.
  • Never underestimate external factors like the ISP introducing QoS or extra security unannounced, or the financial department who forgot to pay the ISP bill

Next time: some practical problem searching on a WAE.

Peter