6509 CPU Utilization Keeps Climbing

sbader48220 · ‎05-22-2007

I have a Catalyst 6509 with a SUP720 running a modular IOS. The IOS filename is s72033-adventerprisek9_wan-vz.122-18.SXF6.bin. I have noticed that the CPU utilization on this switch increases constantly since the time it was last rebooted. On all of our other switches and routers the CPU is higher during the day, and lower during the evening, but with this 6509, the CPU constantly climbs. It never decreases at all. This climb may take over 2 months, but it will start at about 10% utilization, and within 2 months, it'll be near 40% utilization.

I have another 6509 that was just deployed and is also running a modular IOS (but a different version), and I am experiencing the exact same thing. On other 6509's that are running the standard IOS (not modular) we do not see this.

Does anyone know if there are any known issues like this? I tried searching the bug lists, but I didn't see any obvious bugs.

Thanks,

-Steve

salmodov · ‎05-22-2007

Please post show proc cpu and lets see what is using up your resources.

Thanks

Steve

sbader48220 · ‎05-22-2007

Here is the output from the 'show proc cpu' command.

CPU utilization for five seconds: 30%; one minute: 33%; five minutes: 34%

PID 5Sec 1Min 5Min Process

1 0.3% 15.0% 16.0% kernel

3 0.0% 0.0% 0.0% qdelogger

4 0.0% 0.0% 0.0% devc-pty

5 0.0% 0.0% 0.0% devc-mistral.proc

6 0.0% 0.0% 0.0% pipe

7 0.0% 0.0% 0.0% dumper.proc

4104 0.0% 0.0% 0.0% pcmcia_driver.proc

4105 0.0% 0.0% 0.0% bflash_driver.proc

20490 0.0% 0.0% 0.0% mqueue

20491 0.0% 0.0% 0.0% flashfs_hes.proc

20492 0.0% 0.0% 0.0% dfs_bootdisk.proc

20493 0.0% 0.0% 0.0% ldcache.proc

20494 0.0% 0.0% 0.0% watchdog.proc

20495 0.0% 0.0% 0.0% syslogd.proc

20496 0.0% 0.0% 0.0% name_svr.proc

20497 0.0% 0.0% 0.0% wdsysmon.proc

20498 0.0% 0.0% 0.0% sysmgr.proc

24578 0.0% 0.0% 0.0% chkptd.proc

24595 0.0% 0.0% 0.0% sysmgr.proc

24596 0.0% 0.0% 0.0% syslog_dev.proc

24597 0.0% 0.0% 0.0% itrace_exec.proc

PID 5Sec 1Min 5Min Process

24598 0.0% 0.0% 0.0% packet.proc

24599 0.0% 0.0% 0.0% installer.proc

24600 25.9% 16.6% 16.7% ios-base

24601 0.0% 0.0% 0.0% fh_fd_oir.proc

24602 0.0% 0.0% 0.0% fh_metric_dir.proc

24603 0.0% 0.0% 0.0% fh_fd_snmp.proc

24604 0.0% 0.0% 0.0% fh_fd_none.proc

24605 0.0% 0.0% 0.0% fh_fd_intf.proc

24606 0.0% 0.0% 0.0% fh_fd_gold.proc

24607 0.0% 0.0% 0.0% fh_fd_timer.proc

24608 0.0% 0.0% 0.0% fh_fd_ioswd.proc

24609 0.0% 0.0% 0.0% fh_fd_counter.proc

24610 0.0% 0.0% 0.0% fh_fd_rf.proc

24611 0.0% 0.0% 0.0% fh_fd_cli.proc

24612 0.0% 0.0% 0.0% fh_server.proc

24613 0.0% 0.0% 0.0% fh_policy_dir.proc

24614 2.8% 0.3% 0.2% tcp.proc

24615 0.0% 0.0% 0.0% ipfs_daemon.proc

24616 0.4% 0.2% 0.2% raw_ip.proc

24617 0.0% 0.0% 0.0% inetd.proc

24618 0.0% 0.1% 0.2% udp.proc

24619 0.0% 0.1% 0.1% iprouting.iosproc

24620 0.2% 0.1% 0.1% cdp2.iosproc

salmodov · ‎05-22-2007

What about your logs any thing in the logs indicating anything?

I wonder if you are having issues beacuse packets are getting punted to the CPU instead of being software switched CEF.

Please include

Show logs

show ip arp sum

show processes cpu | exclude 0.00

show mls statistics

sbader48220 · ‎05-22-2007

There is nothing in the logs to indicate any sort of a problem. Like I said, the CPU ramps up over several months. After a reboot, the cycle starts again. I was also wondering about packets being punted to the CPU, but everything appears to be running CEF. No ACL's or anything either.

Here are the outputs you requested. I ommitted the 'show log' output, as it has nothing useful in it, but is quite long.

6509#show ip arp sum

1222 IP ARP entries, with 16 of them incomplete

6509#show processes cpu | exclude 0.0

CPU utilization for five seconds: 19%; one minute: 42%; five minutes: 43%

PID 5Sec 1Min 5Min Process

1 0.1% 11.9% 15.6% kernel

24600 17.1% 26.8% 24.7% ios-base

24614 1.7% 1.8% 1.4% tcp.proc

24616 0.2% 0.3% 0.3% raw_ip.proc

24619 0.1% 0.2% 0.1% iprouting.iosproc

6509#show mls statistics

Statistics for Earl in Module 5

L2 Forwarding Engine

Total packets Switched : 48640339585

L3 Forwarding Engine

Total packets L3 Switched : 48594000495 @ 15575 pps

Total Packets Bridged : 26677053072

Total Packets FIB Switched : 20778492539

Total Packets ACL Routed : 0

Total Packets Netflow Switched : 0

Total Mcast Packets Switched/Routed : 127984744

Total ip packets with TOS changed : 2

Total ip packets with COS changed : 2

Total non ip packets COS changed : 0

Total packets dropped by ACL : 0

Total packets dropped by Policing : 0

Total packets exceeding CIR : 0

Total packets exceeding PIR : 0

Errors

MAC/IP length inconsistencies : 13

Short IP packets received : 0

IP header checksum errors : 0

Total packets L3 Switched by all Modules: 48594000495 @ 15575 pps

salmodov · ‎05-22-2007

Can we try clearing the arp and lets see if the issue is with the MAC/IP length inconsistencies

sbader48220 · ‎05-22-2007

I cannot clear the ARP table at this time. This switch is in production.

I don't beleive the 13 errors are the cause of this issue though. There have only been 13 of them, and the switch has been up almost 8 weeks. We have other switches with almost 40 inconsistencies, and they have no problems.

-Steve

salmodov · ‎05-22-2007

http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00804916e0.shtml

The release noteslink below has alot of info on high cpu regarding your code and different enviroments.

http://www.cisco.com/univercd/cc/td/doc/product/lan/cat6000/122sx/ol_4164.htm#wp3144175

Steve

sbader48220 · ‎05-22-2007

I am probably going to revert to a non-modular IOS. I need to schedule a downtime to do this. I was just hoping somemone maybe had some input or had maybe seen this before.

Thanks.

-Steve

avmabe · ‎05-25-2007

I haven't seen anything in this thread asking, but do you by chance run BGP on this box? We had this issue with 7600's (same architecture) when using SUP720-B.

What version of SUP720 are you using and are you taking the full BGP table?

jpl861 · ‎05-25-2007

I just wanna share this, also try to check the IOS version of the switch. We had an incident that both of our 4500 core switches crashed at the same time because they were turned on at the same time. That time, redundancy was totally useless. It was because of a memory leak.

JEFFREY SESSLER · ‎10-31-2007

Did you ever resolve this?

I'm also seeing this exact problem with my 6509 running 12.2(33)SXH modular. Slow creep over time.

A show CPU seems to indicate this is in the kernel process. If I get the details, the high CPU use is from TID 14 of PID 1:

1 14 10 Running 0 (128K) 5h26m procnto-cisco

I can't find any other information past this.

sbader48220 · ‎10-31-2007

The only resolution I found was to upgrade the IOS to a non-modular model. I had to upgrade 2 switches and haven't had a problem since.

After I had this problem I had a Cisco engineer on site about 2 weeks later and he said not to use modular in production systems until it evolves some more.

Hope this helps!

-Steve

cbeswick · ‎11-01-2007

Hi,

We have had exactly the same issues with Modular IOS on the Sup720.

CPU was high on ios-base process. We also suffered from excessively high cpu when issuing show tech-support, or even show run.

We are currently downgrading all sups to non modular ios on the latest safe harbour 12.2(18)SXF8.

This seems to have solved all our issues.

Stay away from the modular IOS!

JEFFREY SESSLER · ‎11-01-2007

I've got a TAC case open now on this so I'll let you know what happens. I've got a graph of CPU over time, and there is a nice stair-step progression of the CPU climbing. On Oct 27th I was a my normal baseline of 8%, now I'm at almost 30%. I looks like it jumps up every 5.5 to 6-hours

Worst case, I'm thinking I'll just force a fail-over to the redundant Sup, and see if that at least helps while cisco sorts it out.