WS-C3560E-12D high cpu utilization

gnijs · ‎01-27-2011

All,

We have 2x C3560E-12D models that are loosing OSPF & PIM neighborships all the time. I guess this is becuase of high CPU. The switch is constantly running around 40-50% with the following processes "Adjust Regions" & "hpm main process":

#sh process cpu sorted
CPU utilization for five seconds: 53%/1%; one minute: 45%; five minutes: 44%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
62    11625769     28660     405644 15.35% 12.91% 13.23%   0 Adjust Regions
93   129026816 41569346       3103 12.95% 9.19% 8.86%   0 hpm main process

I know the C3560E is not the most powerfull platform, but it is only running:

6 OSPF neighbors with normal, standard hello timing

8 PIM neighbors with 200ms timers ( ip pim query-interval 200 msec)

2 PIM neighbors with standard normal timing.

10 HSRP groups with subsecond timing (standby 0 timers msec 200 msec 750)

routing table of 2000 routes (total memory 687796 bytes according to sh ip route summ), no vrf's

(a normal access aggregation switch i would say)

(note: each 10GE interface is equipped with a TwinGig convertor to obtain 24 Gig ports)

We have run this same configuration on C3750-12S switches without problems.

* Is there a difference in CPU between the models ?

* C3560 has 2x more memory than C3750-12S (256MB vs 128MB), but that does not seem to help.

* C3560 version 12.2(50)SE2

Note: we are running sdm prefer routing template.

regards,

Geert

gnijs · ‎01-27-2011

Must be a CPU processing issue. We increased PIM timeouts to default and moved HSRP timers to 1 sec hello/3sec dead. And the problem

went away. If we change ONE single VLAN interface on the box to subsecond HSRP timers (250ms hello/1 dead) -> the VLAN starts flapping.

I checked UDLD (this is ok), i checked errors on the fibers of this vlan (this is ok).

Debugging shows one switch just stops receiving hello packets (in the cpu at least, i think it arrives on the wire but hits the cpu too late):

668379: Jan 27 12:54:30.340 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.253 Active pri 200 vIP 10.103.135.254
668381: Jan 27 12:54:30.382 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668386: Jan 27 12:54:30.542 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.253 Active pri 200 vIP 10.103.135.254
668387: Jan 27 12:54:30.567 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668390: Jan 27 12:54:30.751 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.253 Active pri 200 vIP 10.103.135.254
668391: Jan 27 12:54:30.760 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668394: Jan 27 12:54:31.682 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668395: Jan 27 12:54:31.682 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668398: Jan 27 12:54:31.682 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668403: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Standby pri 100 vIP 10.103.135.254
668404: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Standby: c/Active timer expired (10.103.135.253)
668405: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Active router is local, was 10.103.135.253
668406: Jan 27 12:54:31.691 CET: HSRP: Vl203 Nbr 10.103.135.253 no longer active for group 7 (Standby)
668407: Jan 27 12:54:31.691 CET: HSRP: Vl203 Nbr 10.103.135.253 Was active or standby - start passive holddown
668408: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Standby router is unknown, was local
668409: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Standby -> Active
668410: Jan 27 12:54:31.691 CET: %HSRP-5-STATECHANGE: Vlan203 Grp 7 state Standby -> Active
668411: Jan 27 12:54:31.691 CET: HSRP: Vl203 Interface adv out, Active, active 1 passive 0
668412: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Redundancy "HS01_203" state Standby -> Active
668413: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.252 Active pri 100 vIP 10.103.135.254
668414: Jan 27 12:54:31.691 CET: HSRP: Vl203 Added 10.103.135.254 to ARP (0000.0c07.ac07)
668415: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Activating MAC 0000.0c07.ac07
668416: Jan 27 12:54:31.691 CET: HSRP: Vl203 Grp 7 Adding 0000.0c07.ac07 to MAC address filter
668417: Jan 27 12:54:31.691 CET: HSRP: Vl203 IP Redundancy "HS01_203" standby, local -> unknown
668418: Jan 27 12:54:31.691 CET: HSRP: Vl203 IP Redundancy "HS01_203" update, Standby -> Active
668420: Jan 27 12:54:31.699 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.253 Active pri 200 vIP 10.103.135.254
668421: Jan 27 12:54:31.699 CET: HSRP: Vl203 Grp 7 Active router is 10.103.135.253, was local

Other switch keeps sending:

683706: Jan 27 12:54:29.729 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683709: Jan 27 12:54:29.830 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683710: Jan 27 12:54:29.947 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683713: Jan 27 12:54:30.014 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683715: Jan 27 12:54:30.149 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683716: Jan 27 12:54:30.207 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683720: Jan 27 12:54:30.333 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683722: Jan 27 12:54:30.384 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683727: Jan 27 12:54:30.551 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683728: Jan 27 12:54:30.577 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683731: Jan 27 12:54:30.753 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683732: Jan 27 12:54:30.761 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683735: Jan 27 12:54:30.937 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683737: Jan 27 12:54:31.147 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683741: Jan 27 12:54:31.332 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683743: Jan 27 12:54:31.550 CET: HSRP: Vl203 Grp 7 Standby router is unknown, was 10.103.135.252
683744: Jan 27 12:54:31.550 CET: HSRP: Vl203 Nbr 10.103.135.252 no longer standby for group 7 (Active)
683745: Jan 27 12:54:31.550 CET: HSRP: Vl203 Nbr 10.103.135.252 Was active or standby - start passive holddown
683746: Jan 27 12:54:31.550 CET: HSRP: Vl203 Grp 7 Hello out 10.103.135.253 Active pri 200 vIP 10.103.135.254
683747: Jan 27 12:54:31.642 CET: HSRP: Vl103 Grp 6 Hello out 10.103.134.253 Standby pri 100 vIP 10.103.134.254
683748: Jan 27 12:54:31.692 CET: HSRP: Vl200 Grp 1 Hello in 10.103.129.252 Active pri 200 vIP 10.103.129.254
683749: Jan 27 12:54:31.692 CET: HSRP: Vl203 Grp 7 Hello in 10.103.135.252 Standby pri 100 vIP 10.103.135.254
683750: Jan 27 12:54:31.692 CET: HSRP: Vl203 Grp 7 Standby router is 10.103.135.252
683751: Jan 27 12:54:31.692 CET: HSRP: Vl203 Nbr 10.103.135.252 is no longer passive

Where can i see queue waiting for CPU processing ? it seems the received packet does not hit the CPU fast enough. I don't

see any problems in CPUprocessing:

#             sh ipc queue
There are 0 IPC messages waiting for acknowledgement in the transmit queue.
There are 0 IPC messages waiting for a response.
There are 0 IPC messages waiting for additional fragments.
There are 0 IPC messages currently on the IPC inboundQ.
There are 0 IPC messages currently on the zone inboundQ.
Messages currently in use                     :          4
Message cache size                            :       1000
Maximum message cache usage                   :       1000

0 times message cache crossed 5000 [max]

Emergency messages currently in use : 0

There are 3 messages currently reserved for reply msg.

#sh ipc status

IPC System Status

Time last IPC stat cleared : never

This processor is the IPC master server.
Do not drop output of IPC frames for test purposes.

1000 IPC Message Headers Cached.

Rx Side Tx Side

Total Frames                                         13938966    27877932
Total from Local Ports                               21305517    23785104
Total Protocol Control Frames                         2479587     2046414
Total Frames Dropped                                        0           0

Service Usage

Total via Unreliable Connection-Less Service          9412965     9412965
Total via Unreliable Sequenced Connection-Less Svc          0           0
Total via Reliable Connection-Oriented Service        2479587     2479587

IPC Protocol Version 0

Total Acknowledgements 2479587 2046414
Total Negative Acknowledgements 0 0

Device Drivers

Total via Local Driver                               13938966    13938966
Total via Platform Driver                                   0           0
Total Frames Dropped by Platform Drivers                    0           0
Total Frames Sent when media is quiesced                                0

Reliable Tx Statistics

Re-Transmission 0
Re-Tx Timeout 0

Rx Errors Tx Errors

Unsupp IPC Proto Version          0 Tx Session Error                  0
Corrupt Frame                     0 Tx Seat Error                     0
Duplicate Frame                   0 Destination Unreachable           0
Rel Out-of-Seq Frame              0 Unrel Out-of-Seq Frame            0
Dest Port does Not Exist          0 Tx Driver Failed                  0
Rx IPC Msg Alloc Failed           0 Rx IPC Frag Dropped               0
Rx IPC Transform Errors           0 Tx IPC Transform Errors           0
Unable to Deliver Msg             0 Tx Test Drop                      0
Ctrl Frm Alloc Failed             0 Rx Msg Callback Hog               0

Buffer Errors Misc Errors

IPC Msg Alloc                     0 IPC Open Port                     0
Emer IPC Msg Alloc                0 No HWQ                            0
IPC Frame PakType Alloc           0 Hardware Error                    0
IPC Frame MemD Alloc              0 Invalid Messages                  0

Tx Driver Errors

No Transport                      0
MTU Failure                       0
Dest does not Exist               0

sh mls qos int gi0/20 statistics
GigabitEthernet0/20 (All statistics are in packets)

dscp: incoming
-------------------------------

0 - 4 :   845109533            0            0            0            0
5 - 9 :           0            0            0            0            0
10 - 14 :           0            0            0            0            0
15 - 19 :           0            0            0            0            0
20 - 24 :           0            0            0            0            0
25 - 29 :           0            0            0            0            0
30 - 34 :           0            0            0            0            0
35 - 39 :           0            0            0            0            0
40 - 44 :           0            0            0            0            0
45 - 49 :           0            0            0      2119333            0
50 - 54 :           0            0            0            0            0
55 - 59 :           0            0            0            0            0
60 - 64 :           0            0            0            0
dscp: outgoing
-------------------------------

0 - 4 :   650376000            0            0            0            0
5 - 9 :           0            0            0            0            0
10 - 14 :           0            0            0            0            0
15 - 19 :           0            0            0            2            0
20 - 24 :           0            0            0            0            0
25 - 29 :           0            0            0            0            0
30 - 34 :           0            0            0            0            0
35 - 39 :           0            0            0            0            0
40 - 44 :           0            0            0            0            0
45 - 49 :           0            0            0    397487907            0
50 - 54 :           0            0            0            0            0
55 - 59 :           0            0            0            0            0
60 - 64 :           0            0            0            0
cos: incoming
-------------------------------

0 - 4 : 847200033 0 0 0 0
5 - 7 : 664 2119333 411648
cos: outgoing
-------------------------------

0 - 4 :   650639386            0            2            0            0
5 - 7 :           0    397487907     29946292
output queues enqueued:
queue: threshold1 threshold2 threshold3
-----------------------------------------
queue 0:           0           0           0
queue 1:           0    13655885 1619604162
queue 2:           0           0           0
queue 3:           0           0   650337316

output queues dropped:
queue: threshold1 threshold2 threshold3
-----------------------------------------
queue 0:            0            0            0
queue 1:            0            0            0
queue 2:            0            0            0
queue 3:            0            0            0

Satya Narra · ‎01-27-2011

Hello,

What does the sh proc cpu hist show? Does it show that the cpu is spiking to 80-90%?

I would say 40-50% should not cause the PIM or OSPF flaps.

One more place you can check to see if your packet to the CPU are getting dropped is the show interfaces. Check for input Q drops. Higher input Q drops means that proable packet loss of OSPF/HSRP/PIM (again depends on the code you are running.).

Regards,