Major issues on LAN - help

carl_townshend · ‎12-23-2014

Hi All

Yesterday, we had major issues on our LAN, We have 2 x 4507 core switches, with 2960S switches connected at 10gig to both cores.

The site called saying nothing was accessible, I checked the cores and both were at 100% cpu

The process using the cpu was K5CpuMan Review

alot of the uplinks to the switches had been err disabled with reason loopback error on them, and I could not reach most of the switches!

On one of thr switches I managed to get on, I did a show controllers utilization, and 2 ports were transmitting at 100% and the uplink was receiving at 100%

3 switches are still currently powered off and the core is running 50% cpu, does this seem high?

what could the issue be? loop/broadcast storm maybe ?

what are the best commands on the 4507 for seeing whats going on ?

cheers

Sarbjit-2014 · ‎12-23-2014

Hi Carl,

Try Switch#show processes cpu
This command will hopefully narrow down which process is taking up alot of CPU. This might be a silly question are you running debug by mistake ?

carl_townshend · ‎12-23-2014

the issue still seems to be there but better

the cpu on core 1 is around 50% , core 2 is about 30%

is that high?

its not a big network, about 25 2906s switches connected at 10gig!

I did a packet sniff on the core on a trunk port, and im seeing around 5000 packets per second, multicast to 224.0.0.2 hsrp!

that seems highly excessive!!

how can we get a loop like this on all vlans?

could it be an issue on the cores ?

Madhukrishnan Gopinathan Nair · ‎12-23-2014

cpu running at 50 % is okay, but how much it was before the incident?

I am quite not sure about the packets you see from sniffer? What are the sources of those traffic ?

Are they all from a particular mac address ?

Also share the below

sh ver

sh proc cpu sorted | ex 0.00%

sh mod

Peter Paluch · ‎12-23-2014

Carl,

Some of the symptoms you describe are consistent with a switching loop indeed.

One of the most important clues is the number of err-disabled ports due to loopback. The "loopback" cause for err-disabling a port is, to my best knowledge, always related to a switch receiving back its own LOOP frame. A LOOP frame is sent every 10 seconds out of each switchport, and is both sourced from and destined to the MAC address of the port from which it was sent; in other words, the source and destination MAC address of a LOOP frame are identical and set to the MAC address of the switchport that originated the frame. A neighboring switch receiving such a LOOP frame would, ordinarily, never send the frame back because it would constitute forwarding the frame back the very port on which it was received, and switches should never do that.

So for a switch to actually receive back its own LOOP frame on the port it was originally sent from, it would really require a switching loop to occur somewhere in the network, or it would require the neighboring switch to undergo some strange moment of "mind-blindness" that would cause it to forward a frame back the ingress port. The cause for this is so far unknown.

A number of questions and recommendations - please try to respond to each one of them:

What kind of STP are you running in your network? Specifically, with MST, I have seen these situations occur during major reconvergence events - I have not been able yet to pinpoint the exact cause. I do not think MST alone is to blame; rather, something about its implementation in Catalysts is not quite right.
Are you running Loop Guard on all your switches? If not, be sure to activate it using the global command spanning-tree loopguard default . The Loop Guard is a preventive measure helpful on all switches and all port types, including copper ports.
Do you have any fiber links between switches in your network? If yes, are you running UDLD on them? Make sure that each switch that has a fiber link to another switch is running UDLD aggressive; accomplish that by using udld aggressive global configuration command (it automatically applies to fiber ports only). Do not activate UDLD on copper ports, though.
Do you have a network topology diagram with all switches and links in it, and with all STP facts indicated, including the root switch, root ports on other switches, designated and non-designated port roles and states? If not, create one. Then - as this appears to be a switching loop issue - verify using show span, show span blockedports, show span root, show span bridge commands that on each switch in the topology, the placement of the root switch, root ports, designated and non-designated ports corresponds to the baseline network topology diagram. Any ports whose role and state differs from what it should be should be shutdown, and the cause for this problem investigated.
If you are using any PortFast ports in your network, be absolutely sure to have them protected using BPDU Guard. It is not a 100% prevention against switching loops caused by interconnecting two PortFast ports but it certainly is an additional layer of protection.

Best regards,
Peter

paul driver · ‎12-23-2014

Hello

My initial thought iwould be indeed broadcast storm or stp loop - this can indeed bring down a network you also mentioned loopback err disabled interfaces.

You don't say what kind of precautions are set in place for negating such issues but manually defining stp roots and port security would be the way to go

Is it possible a loop can be introduced into your network by a end user?

i was unfortunate once to have a really big outage related to a loop and I also wasn't able to isolate the problem or access any devices to troubleshoot

so what I did was from the cores switches I manually disconnect each interconnect to my distrubtion closets whilst at the same time monitored the network via icmp

My thought process was this way when the loop was broken Icmp would retrun and then I could trace down the root cause via interconnect that broken the loop

This did indeed work for me that time and I was fortunate it isolated the offending unmanaged network device causing the outage and that was on a 3com network and no such precautions stated above was set in place

res

paul

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Madhukrishnan Gopinathan Nair · ‎12-23-2014

Dear Carl,

This does seems to be due to traffic hitting cpu.

Please send me the below from the core switch.

sh ver

sh mod

sh proc cpu sorted | ex 0.00%

Thanks,

M