05-27-2015 04:11 AM - edited 03-08-2019 12:12 AM
Hi ,
We have a large network with 50 odd edge switches of c3750x and c3560 with 2 C6509 core switches. We have STP setup. Each of our cabinets ( switches or stack of switches ) have 2 links to each core.
Couple of days ago we had a broadcast storm and that had resulted in the couple of switches to 99% cpu usage and the core switches going to 50% cpu usage. While we were investigating and trying to find the problem our stack of 4 3750x backbone switch, that holds all comms between server, SAN etc, went upto 99% cpu usage and most of the network became unresponsive.
We tested uplinks on this stack and then restarted the stack with no results. At this point the core switches went upto 99% and was unresponsive as well. We had the core switches restarted at this point and the systems were accessible. We then started fault finding and turning off one cabinet at a time to pin point the source of the ARP requests that is flooding the core.
We then found the stack of four 3750x switches that was causing this issue by turning off the links. We then turned on one link at a time and found that having both links on will cause this issue to happen again. We have since swapped all cable and sfp's on this stack and the core and tested again - same results. having this 2 links on at the saem time cause out backbone switch go upto 99% cpu usage and core switches to go up to 50% cpu usage. eventually the cores will become unresponsive if left like this.
Now, while troubleshooting at the beginning we had unplugged 1 link from serverbackbone switch to pri-core switch. We havent plugged it back in until yesterday when I plugged it back in and that caused the CPU usage of the servebackbone switch to go 99% and we had the ARP requests coming with the mac address of the root bridge. We had ARP debug and then terminal monitor on the serverbackbone to view this. Here are some logs -
SERVERBACKBONE#sh proc cpu sorted
CPU utilization for five seconds: 98%/6%; one minute: 98%; five minutes: 99%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
234 246760662 652863904 377 25.32% 25.68% 25.57% 0 HULC DAI Process
158 358036994 1223826022 292 23.24% 23.19% 23.35% 0 Hulc LED Process
243 368036878 823215038 447 15.86% 15.53% 15.44% 0 IP Host Track Pr
80 3706752493 627459056 5907 4.64% 4.45% 4.55% 0 RedEarth Tx Mana
79 1243780293 1046208607 1188 4.16% 3.88% 3.96% 0 RedEarth I2C dri
122 801262125 49086849 16323 2.40% 2.13% 2.15% 0 hpm counter proc
100 62847439 1249279689 50 1.76% 2.16% 2.19% 0 HLFM address lea
188 129231690 399861229 323 1.12% 0.76% 0.71% 0 Auth Manager
4 943455916 41214178 22891 0.96% 0.84% 0.80% 0 Check heaps
227 260103995 610366929 426 0.80% 1.09% 1.13% 0 Spanning Tree
236 129781056 489282578 265 0.80% 0.58% 0.56% 0 HRPC ip device t
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
1 1142277 15160342 75 0.80% 0.27% 0.25% 0 Chunk Manager
150 11656436 241612154 48 0.64% 0.58% 0.58% 0 Hulc Storm Contr
38 43247367 2617923 16519 0.64% 0.41% 0.12% 0 crypto sw pk pro
89 219361151 628845123 348 0.64% 0.73% 0.78% 0 hrpc <- response
118 89165092 983004879 90 0.48% 0.64% 0.59% 0 hpm main process
239 51429378 48581317 1058 0.48% 0.46% 0.47% 0 PI MATM Aging Pr
298 36570618 107544262 340 0.48% 0.17% 0.19% 0 Marvell wk-a Pow
172 188017913 9765876 19252 0.48% 0.42% 0.42% 0 HQM Stack Proces
173 167841714 58364536 2875 0.32% 0.33% 0.32% 0 HRPC qos request
240 16193895 500821238 32 0.32% 0.34% 0.30% 0 UDLD
297 47265954 135448215 348 0.16% 0.16% 0.15% 0 Inline Power
I checked logs on solarwinds monitor and this is what started around that time -
22/05/2015 09:38:54 | hh_cab06_sw1 | Warning | 991325: Host c8f9.f958.c000 in vlan 910 is flapping between port Gi3/1/1 and port Gi1/1/1 | |
22/05/2015 09:38:52 | hh_cab06_sw1 | Warning | 991324: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:52 | hh_cab06_sw1 | Warning | 991323: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:51 | hh_cab06_sw1 | Warning | 991321: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:49 | hh_cab06_sw1 | Warning | 991316: Host c8f9.f958.c000 in vlan 910 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:49 | hh_cab06_sw1 | Warning | 991317: Host c8f9.f958.c000 in vlan 902 is flapping between port Gi3/1/1 and port Gi1/1/1 | |
22/05/2015 09:38:48 | hh_cab23b_sw1 | Warning | 22743: Host 00c0.b767.6b64 in vlan 902 is flapping between port Fa0/24 and port Gi0/1 | |
22/05/2015 09:38:46 | hh_cab23_sw1 | Warning | 187822: Host 00c0.b767.6b64 in vlan 902 is flapping between port Gi1/1/1 and port Gi1/1/4 | |
22/05/2015 09:38:46 | hh_cab23b_sw1 | Warning | 22742: Host 00c0.b767.6b64 in vlan 902 is flapping between port Gi0/1 and port Fa0/24 | |
22/05/2015 09:38:46 | hh_cab23_sw2 | Warning | 16432: Host 00c0.b767.6b64 in vlan 902 is flapping between port Gi0/4 and port Gi0/2 | |
22/05/2015 09:38:45 | hh_cab06_sw1 | Warning | 991313: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:45 | hh_cab06_sw1 | Warning | 991312: Host 00c0.b767.6b64 in vlan 902 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:43 | hh_cab06_sw1 | Warning | 991309: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:43 | hh_cab06_sw1 | Warning | 991308: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:42 | hh_cab06_sw1 | Warning | 991306: Host c8f9.f958.c000 in vlan 900 is flapping between port Gi1/1/1 and port Gi3/1/1 | |
22/05/2015 09:38:35 | hh_cab06_sw1 | Warning | 991302: Host c8f9.f958.c000 in vlan 910 is flapping between port Gi3/1/1 and port Gi1/1/1 |
So, I have trying to find the root cause of this problem. We do need to turn this secondary link back the cabinets where it started - CAB06 and serverbackbone where we found the cpu usage going high first.
Any ideas guys ?
Regards,
Sheikh
05-27-2015 04:37 AM
Hi Sheikh,
could you provide a Layout of your configuration.
This sounds a bit like a loop.
Regards,
Markus
05-27-2015 06:23 AM
05-27-2015 06:27 AM
Is Core Switch 1 and 2 in a VSS configuration, or are the individual switches?
If they are in VSS mode. The uplinks from Cap06 must be configured as trunk.
I saw you have a lot of spanning-tree manipulations. What is the reason for it?
For this environment you should only make sure Core 1 and 2 are root, the rest can stay default from my point of view.
You should only tune spanning-tree if you 100% understand what your doing and why, otherwise you may easily break the network with it.
Regards,
Markus
05-27-2015 07:58 AM
HI Markus,
Core1 is the root and Core2 is standby. Individual switches. Removing STP will probably require a downtime - cant afford a downtime at the moment. I need a change control for that.
It is most likely a STP issue. but the question is how to be sure and how to find the rot cause/loop.
Regards,
Sheikh
05-27-2015 08:06 AM
ok... thats good...
Do you run PV-RSPT+ ?
Please check the status for all vlan's involved.
Also check spanning-tree on the other switches, if there is no inconsistency.
05-27-2015 11:01 PM
I would go back to a basic config an make sure if everything works as expected.
Standard PV-RSTP+ config with Core 1 and 2 as root.
If that works fine, you can start adding spanning-tree configurations step by step.
Or you troubleshoot the current environment. But as far as I understand this is a productive environment and you cannot just re-create the problem and troubleshoot, correct?
Regards,
Markus
05-27-2015 07:42 AM
ok, I think I understand the topology.
The uplinks are ether-channels, correct? So you have 4 uplinks in total, 2 to each core switch?
Are you a 100% sure that Core 1 and Core 2 are elected Spanning Tree Root?
Could you please verify this. (show spanning-tree, and you should see a "this bridge is the root)
If you're not 100% sure about you're spanning-tree config and root guard etc. please remove all of it and check whether it works or not.
I am almost certain it is a spanning-tree issue.
Regards,
Markus
05-27-2015 05:24 AM
Can you provide the config for the ports from both the 3750 stack side (i.e the ones you narrowed down the problem to) and the switch they connect to?
05-27-2015 05:59 AM
Hi,
her are the relevant configs -
this is the switch that gets hammedred when link turned on cab06
SERVERBACKBONE#sh cdp neigh
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone,
D - Remote, C - CVTA, M - Two-port Mac Relay
Device ID Local Intrfce Holdtme Capability Platform Port ID
THHCORE2.hilldomain.thh.nhs.uk
Gig 3/1/2 158 R S I WS-C6509- Gig 9/4
THHCORE2.hilldomain.thh.nhs.uk
Gig 3/1/1 165 R S I WS-C6509- Gig 9/3
SERVERBACKBONE#sh run int gi 3/1/1
Building configuration...
Current configuration : 329 bytes
!
interface GigabitEthernet3/1/1
description THHCORE2
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
spanning-tree guard loop
channel-group 2 mode desirable
end
SERVERBACKBONE#sh run int gi 3/1/2
Building configuration...
Current configuration : 329 bytes
!
interface GigabitEthernet3/1/2
description THHCORE2
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
spanning-tree guard loop
channel-group 2 mode desirable
end
SERVERBACKBONE#sh run int gi 1/1/1
Building configuration...
Current configuration : 339 bytes
!
interface GigabitEthernet1/1/1
description THHCORE1
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
shutdown
spanning-tree guard loop
channel-group 1 mode desirable
end
SERVERBACKBONE#sh run int gi 1/1/2
Building configuration...
Current configuration : 339 bytes
!
interface GigabitEthernet1/1/2
description THHCORE1
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
shutdown
spanning-tree guard loop
channel-group 1 mode desirable
end
SERVERBACKBONE#
SERVERBACKBONE#
SERVERBACKBONE#
SERVERBACKBONE#
SERVERBACKBONE#sh run int po1
Building configuration...
Current configuration : 287 bytes
!
interface Port-channel1
description THHCORE1 ETHERCHANNEL
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
shutdown
end
SERVERBACKBONE#sh run int po2
Building configuration...
Current configuration : 301 bytes
!
interface Port-channel2
description THHCORE2 ETHERCHANNEL
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
spanning-tree cost 200
end
SERVERBACKBONE#
this is the cab which we truned of links from to stabalist the netwrok.
HH_CAB06_SW1#sh cdp neigh
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone,
D - Remote, C - CVTA, M - Two-port Mac Relay
Device ID Local Intrfce Holdtme Capability Platform Port ID
THHCORE2.hilldomain.thh.nhs.uk
Gig 1/1/1 154 R S I WS-C6509- Gig 7/7
HH_CAB06_SW1#sh run int gi 1/1/1
Building configuration...
Current configuration : 176 bytes
!
interface GigabitEthernet1/1/1
description THHCORE2
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 106,144,900,902,910,996
switchport mode trunk
end
HH_CAB06_SW1#sh run int gi 1/1/2
Building configuration...
Current configuration : 180 bytes
!
interface GigabitEthernet1/1/2
description UPLINK
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 106,900,902,910,996
switchport mode trunk
shutdown
end
HH_CAB06_SW1#sh run int gi 3/1/1
Building configuration...
Current configuration : 186 bytes
!
interface GigabitEthernet3/1/1
description THHCORE1
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 106,144,900,902,910,996
switchport mode trunk
shutdown
end
HH_CAB06_SW1#sh run int gi 3/1/2
Building configuration...
Current configuration : 180 bytes
!
interface GigabitEthernet3/1/2
description UPLINK
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 106,900,902,910,996
switchport mode trunk
shutdown
end
HH_CAB06_SW1#
these are tthe links on the core switches -
THHCORE1#sh run int po2
Building configuration...
Current configuration : 321 bytes
!
interface Port-channel2
description SERVERBACKBONE ETHERCHANNEL
switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
spanning-tree guard root
end
THHCORE1#sh run int gi 3/8
Building configuration...
Current configuration : 716 bytes
!
interface GigabitEthernet3/8
description ##Connects to S06SW1##
switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 106,144,900,902,910,996
wrr-queue bandwidth 30 70
wrr-queue queue-limit 40 30
wrr-queue random-detect min-threshold 1 40 80
wrr-queue random-detect min-threshold 2 70 80
wrr-queue random-detect max-threshold 1 80 100
wrr-queue random-detect max-threshold 2 80 100
wrr-queue cos-map 1 1 1
wrr-queue cos-map 1 2 0
wrr-queue cos-map 2 1 2 3 4
wrr-queue cos-map 2 2 6 7
mls qos trust dscp
storm-control broadcast level 50.00
storm-control multicast level 50.00
storm-control action trap
spanning-tree guard root
service-policy input SCANNER
end
THHCORE1#sh int status
Gi3/8 ##Connects to S06S notconnect 1 full 1000 1000BaseSX
Gi9/2 SERVERBACKBONE notconnect 1 full 1000 1000BaseSX
Gi9/3 SERVERBACKBONE notconnect 1 full 1000 1000BaseSX
Po2 SERVERBACKBONE ETH notconnect 1 auto auto
THHCORE1#
THHCORE2#sh int status
Gi3/6 ##Connects to S06S notconnect 1 full 1000 1000BaseSX
Gi7/7 HH_CAB06_SW1 - 1/1 connected trunk full 1000 1000BaseSX
Gi9/3 SERVERBACKBONE connected trunk full 1000 1000BaseSX
Gi9/4 SERVERBACKBONE connected trunk full 1000 1000BaseSX
Po2 SERVERBACKBONE ETH connected trunk a-full a-1000
THHCORE2# sh run int po2
Building configuration...
Current configuration : 341 bytes
!
interface Port-channel2
description SERVERBACKBONE ETHERCHANNEL
switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
mls qos trust dscp
spanning-tree guard root
end
THHCORE2# sh run int gi 9/3
Building configuration...
Current configuration : 339 bytes
!
interface GigabitEthernet9/3
description SERVERBACKBONE
switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
mls qos trust dscp
channel-group 2 mode desirable
end
THHCORE2# sh run int gi 9/4
Building configuration...
Current configuration : 339 bytes
!
interface GigabitEthernet9/4
description SERVERBACKBONE
switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 42,43,53-57,200,300,607,620,900,902,905,910
switchport trunk allowed vlan add 930-933,935,936,940,950-953,990,995,996
switchport mode trunk
mls qos trust dscp
channel-group 2 mode desirable
end
THHCORE2#sh run int gi 7/7
Building configuration...
Current configuration : 684 bytes
!
interface GigabitEthernet7/7
description HH_CAB06_SW1 - 1/1/1
switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 106,144,900,902,910,996
switchport mode trunk
wrr-queue bandwidth 30 70
wrr-queue queue-limit 40 30
wrr-queue random-detect min-threshold 1 40 80
wrr-queue random-detect min-threshold 2 70 80
wrr-queue random-detect max-threshold 1 80 100
wrr-queue random-detect max-threshold 2 80 100
wrr-queue cos-map 1 1 1
wrr-queue cos-map 1 2 0
wrr-queue cos-map 2 1 2 3 4
wrr-queue cos-map 2 2 6 7
mls qos trust dscp
storm-control broadcast level 50.00
storm-control multicast level 50.00
service-policy input SCANNER
end
let me know if you need anything more..
Many thanks for your help..
Regards,
Sheikh
05-27-2015 06:06 AM
It is a bit difficult to understand all of it without a picture.
But is is possible that you're missing an ether-channel configuration on HH_CAB06_SW1.
If you have more than one physical link between two devices or a VSS cluster, you need to aggregate them into a port channel.
As far as I can see your uplinks are not part of a port-channel.
05-27-2015 06:19 AM
I think a topology diagram is needed, I can't work it out from the above configuration.
Are your Core switches a VSS pair or individual units connected together via an Etherchannel?
05-27-2015 06:27 AM
Hi,
They are individulal c6509s with fible etherchannel between them.
Cheers,
Sheikh
05-27-2015 06:26 AM
HiGents,
There is sport channel between the serverbackbone and the Core switches. But the link between the cab06 and the Core switches are not port channels. They are single fibre links to each core as any other cabinets in our organization.
Also, the links between the Cores and the serverbackbone is fibre.
I agree with you guys that its may be a loop. How can we confirm this please ?
Regards,
Sheikh
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide