Solved: Hi

HQ_Establishment · ‎11-07-2016

We are facing a problem in an end user site (a school); the problem symptom is the connectivity, as there is a packet loss happens randomly and the ping between VLANs keep on disconnecting every while (3 or 4 request timed out) and at the same time the CPU is spiking to 80-100% specially during the school working hours and then the ping return to normal, in the same VLAN (L2 switching) there is no problem and everything is working properly.

The Core switch is "2 * SG500-52 52-Port" in Native Stacking mode and the stack connection speed 1G, the firmware downgraded from 1.4.5 to 1.4.2 but the same problem still appeared. it was downgraded as part of testing multiple firmware versions.

The core switch (Stacked SG500 switches) has 19 uplinks, 17 Link of them are aggregation between The core switch (Stacked SG500 switches) and the edge switches, and 2 uplinks without link aggregation just a one connection uplink from the core to the edge switch.

Access list is configured on the switch, but the problem was occurring before the implementation of the access lists.

There are 17 VLAN and DHCP Relay on 15 out of 17 VLAN,DHCP server role in Windows Server 2008.

Switch

Location

IP Address

Core Cisco (SG500-52) * 2

Main Cabinet

192.168.101.1

Cisco (SG500-28)

Server Cabinet

192.168.101.2

HP (1810-24)

Main Cabinet

192.168.101.3

Cisco (SG300-28P)

DT Lab

192.168.101.4

Cisco (SG300-28P)

S Lab 1

192.168.101.5

Cisco (SG300-28P)

S Lab 2

192.168.101.6

Cisco (SG300-28P)

S Lab 3

192.168.101.7

Cisco (SG300-28P)

J Lab 1

192.168.101.8

Cisco (SG300-28P)

J Lab 2

192.168.101.9

Cisco (SG300-28P)

LSD

192.168.101.10

Cisco (SG300-28P)

S SLab

192.168.101.20

Cisco (SG300-52P)

S1

192.168.101.21

HP (1810-24)

S2

192.168.101.22

HP (1810-24)

S3

192.168.101.23

Cisco (SG300-28P)

Meeting

192.168.101.24

HP (1810-24)

Senior English Room

192.168.101.25

HP (1810-48)

N3

192.168.101.30

HP (1810-48)

N4

192.168.101.31

Cisco (SG300-52P)

N5

192.168.101.32

Cisco (SG300-52P)

KG

192.168.101.40

Cisco (SG300-28P)

GYM 1.1 B

192.168.101.50

Cisco (SG300-52P)

GYM 1.2 B

192.168.101.51

Cisco (SG300-28P)

GYM 1st

192.168.101.52

Cisco (SG300-52P)

GYM 2nd

192.168.101.53

Cisco (SG300-52P)

GYM 3rd

192.168.101.54

So all on all, I'm trying to find the source of the disconnection problem and the sudden CPU spikes.

may you please assist me to solve this problem.

Best regards

Michal Bruncko · ‎11-21-2016

that CPU spikes can be related to occasional switching loops/storms in your network. I assume that your SG500 stack is the only routing device from whole switching environment, right?

check STP configuration. your SG500 stack should be configured explicitly as root bridge. you should have RSTP enabled (on all switches). or you are using MST configuration?
enable BPDU guard on all access ports toward end users (additionally you can configure err-recovery)
configure root guard on all SG500 downlink ports
enable STP loop protecton and also loop protect features in order to avoid having/keeping loops in network active
check if all etherchannels are correctly bundled (always use LACP for bundling links together) on both sides of connection
configure single system as syslog server and send there logs from all switches in order to build chronology event scenario about what is happening in network

View solution in original post

Michal Bruncko · ‎11-21-2016

that CPU spikes can be related to occasional switching loops/storms in your network. I assume that your SG500 stack is the only routing device from whole switching environment, right?

check STP configuration. your SG500 stack should be configured explicitly as root bridge. you should have RSTP enabled (on all switches). or you are using MST configuration?
enable BPDU guard on all access ports toward end users (additionally you can configure err-recovery)
configure root guard on all SG500 downlink ports
enable STP loop protecton and also loop protect features in order to avoid having/keeping loops in network active
check if all etherchannels are correctly bundled (always use LACP for bundling links together) on both sides of connection
configure single system as syslog server and send there logs from all switches in order to build chronology event scenario about what is happening in network

HQ_Establishment · ‎11-22-2016

Enabling BPDU guard on all access ports and root guard on all Core switch downlink ports had been applied and the network now stable. will wait for a couple of days and update you.

Appreciate your advise.

HQ_Establishment · ‎12-04-2016

Hi Michal,

i would like to thank you for your help, you have pinpointed the problem from the first shot.

one extra question if you may.

we already applied the SYSLOG on all switches from day 1. i was not noticing major or repeated topology changing on the syslog server. and for sure i did not find any log of re-electing the root bridge. so is there any article or explination why the root guard was such a use.

Best regards and many thanks again.

Michal Bruncko · ‎12-04-2016

Hi

I am glad that it helped.

regards to your question: there could be two possible explanations:

spanning-tree changes are not shown in syslog output for small business switches. for example on enterprise class switches you have to enable that explicitly with command "logging event spanning-tree status". the only way to see some STP "history" is from "show spanning-tree" command - output like: "Number of topology changes 276 last change occurred 124:53:06 ago".
second explanation would be that root guard is not that killer feature which helped in your network, but the another one "bpdu guard" which blocks all ports with connected rogue switches which could cause loops. anyway in case that root guard really helped then you probably encountered situation that some of your core switch downlinks went into "root-inconsistent" disabled state (due receipt of BPDU with better priority for taking root bridge role over) means that your downlink will be put in disabled state and thus block whole part of network behind that downlink.

anyway all events resulted from BPDU guard or Root Guard activities are logged so you should see them appearing in syslog output.

SG500-52 Random CPU spikes and packet loss