How could we solve intermittent network failure issues

tmsundar81 · ‎01-29-2007

I have a problem, where i am finding it difficult to find the root cause.The pblm is the end users is connected to access switches which in turn conected to the core switches which are running HSRP and all the switches are 3550 switches, intermittently the users were thrown out of the applications and the switches were not rechable from the NOC,both the noc and the application servers are outside the network like the application is over the WAN and the noc is just anothr lan, and there are no suspicious logs on the switches and the pblm is restored automatically, what i need is, is there any logs which i can enable it on the switches to find the cause, logs like stp changing or any other logs.

tmsundar81 · ‎01-29-2007

its not with all the switches with only few like 2 out of 10 switches facing this kind of pblm and the users were also connected with the two switches

gopi.tadikonda · ‎01-29-2007

Can you check on the switch what is the memoary is being utilized and even try to keep a continuas ping to both the switchs.

tmsundar81 · ‎01-29-2007

i dont find any issue with the memory, and the pblm is intermittent like once in two month or once in a month its not continuous.

tmar · ‎01-30-2007

There's not enough detail here to talk definitively about the issue, but I'll throw a couple of suggestions your way.

You need to identify what layer the error is happening at, L3, L2, etc.

If logging is in fact enabled and you were losing any of the routing or uplink interfaces on the switches there should be a message generated in the log. Likewise the routing between the networks should only fail if the routed interfaces went down. i.e. the SVI, or the routing protocol should actually show that the route for the application network, or for the end-user network changed. A show ip route command should tell you if this is happening or not. The route should show a last update timer which would have changed at the time of the failure.

If you're not seeing errors in the log files on any of the switches and the layer three routing doesn't appear to be the culprit, then I would investigate the spanning-tree. You may be experiencing a spanning-tree reconvergence which is blocking your users. One quick way to check is to do a show spanning-tree detail and look at the last topology change for the vlan that's experiencing the problem.

The output should look something like this:

VLAN0001 is executing the rstp compatible Spanning Tree protocol

Bridge Identifier has priority 32768, sysid 1, address 0000.0000.0000

Configured hello time 2, max age 20, forward delay 15

We are the root of the spanning tree

Topology change flag not set, detected flag not set

Number of topology changes 1 last change occurred 1y0w ago

from Port-channel1

Times: hold 1, topology change 35, notification 2

hello 2, max age 20, forward delay 15

Timers: hello 0, topology change 0, notification 0, aging 300

If the "number of topology changes" and "last change" are a high number, and fairly recent, then spanning tree has reconverged often/recently for the vlan that is experiencing the problem. If this is the case, review your switching design for best practice considerations such as: turn portfast on for all end-user ports and remove "loops" in the topology without sacrificing any redundancy. If you turn on portfast for end-user ports, make sure you use some protective features on the switches such as bpduguard, loopguard, storm-control, etc to avoid having some user plug a hub into your network and take it down with a broadcast storm.

Hope this is helpful.

Tod Martinsen