Re: 6509 CPU Utilization Keeps Climbing - Page 2

sbader48220 · ‎05-22-2007

I have a Catalyst 6509 with a SUP720 running a modular IOS. The IOS filename is s72033-adventerprisek9_wan-vz.122-18.SXF6.bin. I have noticed that the CPU utilization on this switch increases constantly since the time it was last rebooted. On all of our other switches and routers the CPU is higher during the day, and lower during the evening, but with this 6509, the CPU constantly climbs. It never decreases at all. This climb may take over 2 months, but it will start at about 10% utilization, and within 2 months, it'll be near 40% utilization.

I have another 6509 that was just deployed and is also running a modular IOS (but a different version), and I am experiencing the exact same thing. On other 6509's that are running the standard IOS (not modular) we do not see this.

Does anyone know if there are any known issues like this? I tried searching the bug lists, but I didn't see any obvious bugs.

Thanks,

-Steve

Steffen Lindemann · ‎04-02-2008

Hi,

Did you ever get a solution on your TAC?

I see the same issue.

6509, sup720, IOS 12.2(33)SHX and no fancy features enabled.

The CPU load increase in steps of approx. 15% every 3 week. And it is the ios-base that takes the CPU.

sbader48220 · ‎04-02-2008

The only solution I ever received to this problem was to run the non-modular IOS. Ever since switching to the non-modular IOS, I have not had any problems at all.

JEFFREY SESSLER · ‎04-02-2008

Yes, Cisco did fix the problem for us. There was a memory leak in UDP that would slowly consume all available memory. As memory became scarce, the CPU was spending more and more time on freeing memory.

Cisco supplied us with a patched IOS, but the fix was then rolled in to 12.2(33)SXH1. With the patch and now SHX1 installed, we're no longer seeing the issue.

If you've got dual SUPs, and you can't upgrade IOS right away, just force fail-over between the two sups every couple of week to keep the problem at bay.

Jeff

Steffen Lindemann · ‎04-02-2008

Okay, the status after a day working on it (and a possible way of replicate the issue).

The hops in CPU load seems to be originate from a open source network tool, netdisco (search google or sourceforce for more info on this very nice tool).

It is based on a Net-SNMP packet, which do all the SNMP i/o.

I was running the lates CVS version.

I dont have a spare 6509 to replicate the issue, but as a part of the normal netdisco operation, there was a cron job which made a topological discovery. The command is ./netdisco -r CORESWITCH

I have not time to dig into why a -r would be different for other operations of netdisco, for instance mac-suck or arp-suck. Only the -r seems to cause the CPU additional load.

I dont have a backup sup, the money was spend on a 2nd 6509 (to get the number of port needed) so I really need a software that is stable.

I have involved my vendor and if anybody is interested I will report back here regard the outcome?

Joseph W. Doherty · ‎04-02-2008

I haven't check recently, but recall none of the modular versions had passed safe harbor testing. If true, unless the modular feature is really, really important to you, you might consider moving back to a non-modular image.

jschweng · ‎07-30-2008

We recently upgraded 6 of our 6506 switches in our Development environment. On 2 of our switches, we are seeing a steady CPU utilization increase. "Show proc cpu detail sort" shows it is due to ios-base. That really doesn't tell us much. Our logs on the 2 switches show some SNMP authentication error messages but not much else. The other switches don't have this SNMP error message. Any ideas? We are running 12.2(33)SXH2a. I have a Cisco ticket too.

Steffen Lindemann · ‎07-30-2008

I am 90% sure that my situation was like below, I have however not been able to test it afterward.

The problem came after I installed a network tool which did snmp read via snmpwalk.

I have not read through the RFC's but there is a older snmpwalk and a newer bulkget for retrieving multiple snmp values.

On my 6509's and 3750's there is large routing tables and sometimes the boxes was slow to complete a snmpwalk of the whole OID tree.

What happen was that the job got stuck and was not ended before a new job was initiated. This happen once a week!

My solution was to tweak the tool to do bulkget and I have not seen the problem since.

I still believe that the IOS have a snmp bug, because it look like DoS if the box can not complete a snmpwalk of the full tree within one week.

So you might check if you have the same situation, a tool doing snmpwalk (mine was the open source tool netdisco).