cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1159
Views
0
Helpful
1
Replies

Catalyst 3750X stack - High CPU and CPUHOG by hpm main process when links flapping

Jevgeni Rõžov
Level 1
Level 1

Hello, 

It looks like I am hitting the CSCtl04815. I have narrowed the setup down to the following: two 3750X in a stack and one Dell R620 server connected to it. No L3 magic is being run on the stack.

The problem appears when all of the server interfaces flap - this is actually an expected situation, which might happen when the server gets rebooted, for instance. This behavior leads to all other tasks, which are processed in CPU being throttled (including LACP keepalives, etc.).

The server is connected to the stack via 4 cables, 2 per each stack member. The switch stack is an ex-member (client) of a VTP domain, but is now completely isolated from other infrastructure, leaving connected Fa0 interface aside. Currently the stack is running 12.2(55)SE5 of IOS software, but the same behavior is observed on a similar stack with 15.2(2)E4.

Limiting allowed VLANs on the trunk ports (switchport trunk allowed vlan <blah>) do have a positive impact, but this is pretty much out of the question because it will leave me with a huge ammount of overhead while adding new VLANs or managing existing downlinks to servers. Setting speed or duplex settings maybe slightly improves the situation, but does not eliminate the problem completely.

Could anyone please help to debug my issue?

Relevant stack configuration:

lab-ns01#show version
<truncated>
Switch Ports Model SW Version SW Image
------ ----- ----- ---------- ----------
* 1 54 WS-C3750X-48 12.2(55)SE5 C3750E-UNIVERSALK9-M
2 54 WS-C3750X-48 12.2(55)SE5 C3750E-UNIVERSALK9-M
</truncated>
lab-ns01#sho lic
Index 1 Feature: ipservices
Period left: 8 weeks 4 days
License Type: Evaluation
License State: Active, Not in Use, EULA not accepted
License Priority: None
License Count: Non-Counted

Index 2 Feature: ipbase
Period left: Life time
License Type: Permanent
License State: Active, In Use
License Priority: Medium
License Count: Non-Counted

Index 3 Feature: lanbase
Period left: 0 minute 0 second

lab-ns01#sho vtp status
VTP Version capable : 1 to 3
VTP version running : 3
VTP Domain Name : <somevtpdomain>
VTP Pruning Mode : Disabled
VTP Traps Generation : Disabled
Device ID : <stack id>

Feature VLAN:
--------------
VTP Operating Mode : Client
Number of existing VLANs : 256
Number of existing extended VLANs : 239
Configuration Revision : 530
Primary ID : <vtp server id>
Primary Description : <vtp server>
MD5 digest : <md5hash>

Feature MST:
--------------
VTP Operating Mode : Client
Configuration Revision : 0
Primary ID : 0000.0000.0000
Primary Description :
MD5 digest : 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Feature UNKNOWN:
--------------
VTP Operating Mode : Transparent

lab-ns01#sho vlan sum
Number of existing VLANs : 495
Number of existing VTP VLANs : 256
Number of existing extended VTP VLANS : 239

All server interfaces have following configuration:

switchport trunk encapsulation dot1q
switchport mode trunk
switchport nonegotiate
no vtp
spanning-tree portfast trunk
spanning-tree bpduguard enable

Some logging/debug information:

Feb 19 08:55:36.698: %SYS-3-CPUHOG: Task is running for (2097)msecs, more than (2000)msecs (0/0),process = hpm main process.
-Traceback= 1EB4880 1BD65B8 1BD609C AE9B0 AEE24 1BD6148 1BD609C AE9B0 AEF30 1BDBD2C 1C1DEC4 10BD09C 1C1FA70 1C19B78 1C19E4C 1C0CDB8
Feb 19 08:55:37.461: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/13, changed state to down
Feb 19 08:55:38.090: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/14, changed state to down
Feb 19 08:55:39.567: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/13, changed state to up
Feb 19 08:55:42.897: %SYS-3-CPUHOG: Task is running for (2098)msecs, more than (2000)msecs (3/3),process = hpm main process.
-Traceback= 134C08C 134C164 13444E4 13489E8 1FDC914 17C41C8 17C428C 1981614 197DEB4 197DBD8 1BD62C4 1BD609C AE9B0 AEE24 1BD6150 1BD609C
Feb 19 08:55:43.098: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/13, changed state to down
Feb 19 08:55:44.683: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/0/14, changed state to down
Feb 19 08:55:46.311: %LINK-3-UPDOWN: Interface GigabitEthernet2/0/14, changed state to down
Feb 19 08:55:46.311: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/13, changed state to up
Feb 19 08:55:47.477: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/14, changed state to up
Feb 19 08:55:48.777: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/0/13, changed stat n
Feb 19 08:55:52.149: %LINK-3-UPDOWN: Interface GigabitEthernet2/0/14, changed state to up
Feb 19 08:55:52.333: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/0/13, changed state to up
Feb 19 08:55:53.156: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/0/14, changed state to up

Following process list is executed into exact moment when the problem appeared:

lab-ns01#sho proc cpu sort | ex 0.0
CPU utilization for five seconds: 99%/0%; one minute: 50%; five minutes: 30%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
112 541034 693028 780 29.83% 16.29% 6.04% 0 hpm main process
268 132835 812 163589 19.53% 4.06% 1.60% 0 IGMPSN L2MCM
273 123349 801 153993 15.53% 3.61% 1.45% 0 MLDSN L2MCM
281 99034 479 206751 13.99% 3.62% 1.29% 0 SpanTree Helper
302 28656 275269 104 5.69% 0.92% 0.27% 0 PM Callback
156 1427184 1330717 1072 3.07% 2.24% 2.26% 0 Hulc LED Process
238 14863 1796 8275 1.23% 0.32% 0.13% 0 VMATM Callback
116 1039582 118185 8796 1.07% 1.39% 1.57% 0 hpm counter proc
220 7452 2723 2736 1.07% 0.26% 0.09% 0 802.1x switch
230 994391 56672 17546 0.92% 1.40% 1.56% 0 PI MATM Aging Pr
74 1713574 275054 6229 0.61% 2.05% 2.55% 0 RedEarth Tx Mana
73 284677 416352 683 0.61% 0.50% 0.50% 0 RedEarth I2C dri
167 216913 11665 18595 0.30% 0.35% 0.32% 0 HQM Stack Proces
95 6963 5624 1238 0.30% 0.16% 0.07% 0 HRPC hlfm reques
38 9446 56747 166 0.15% 0.04% 0.01% 0 Per-Second Jobs
34 15511 68449 226 0.15% 0.06% 0.02% 0 Net Background
186 19184 10449 1835 0.15% 0.08% 0.04% 0 CDP Protocol
168 88458 46424 1905 0.15% 0.13% 0.13% 0 HRPC qos request
69 4845 11634 416 0.15% 0.02% 0.00% 0 Compute load avg
333 1437 1084 1325 0.15% 0.02% 0.00% 0 SNMP Traps
75 4135 208155 19 0.15% 0.01% 0.00% 0 RedEarth Rx Mana


lab-ns01#show spanning-tree summary
Switch is in mst mode (IEEE Standard)
Root bridge for: MST0-MST1
EtherChannel misconfig guard is enabled
Extended system ID is enabled
Portfast Default is disabled
PortFast BPDU Guard Default is disabled
Portfast BPDU Filter Default is disabled
Loopguard Default is disabled
UplinkFast is disabled
Stack port is StackPort1
BackboneFast is disabled
Configured Pathcost method used is short (Operational value is long)
Name Blocking Listening Learning Forwarding STP Active
---------------------- -------- --------- -------- ---------- ----------
MST0 0 0 0 4 4
MST1 0 0 0 4 4
---------------------- -------- --------- -------- ---------- ----------
2 msts 0 0 0 8 8

Example port statistics taken at the moment when problem appeared:

lab-ns01#sho inter g1/0/14
GigabitEthernet1/0/14 is down, line protocol is down (notconnect)
Hardware is Gigabit Ethernet, address is <macaddress> (bia <macaddress>)
Description: <description>
MTU 9198 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:03, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
135 packets input, 10358 bytes, 0 no buffer
Received 135 broadcasts (135 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 135 multicast, 0 pause input
0 input packets with dribble condition detected
10694 packets output, 1450635 bytes, 0 underruns
0 output errors, 0 collisions, 17 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
1 Reply 1

Mark Malone
VIP Alumni
VIP Alumni

Hey I don't that's the exact bug even though its very similar thats only on 3560 platform

The recommended version for 3750x from Cisco is 15.0.2-SE9(MD)  

That's the safe harbour image, I would try that as its the current most stable , if that does not work you will need to TAC it get the exact image its fixed in , the fact its a process cpu bug theres not really any config that someone could provide to help you fix it unless they have go through the exact same issue and TAC has supplied some fix but 9/10 this requires the correct software the fix is in

Review Cisco Networking for a $25 gift card