I have a Cisco 3550-12G switch experiencing high CPU loads. For a long time (years?), CPU load has been minimal (a couple of percent; this is all graphed by our monitoring systems). About 4 weeks ago, the switch rebooted, possibly due to some power work being done in the same building. Ever since then, CPU is way above baseline and is causing alarms with our monitoring.
IOS is IP Services 12.2(25)SEE2. See attachment for show proc cpu output.
A few minutes ago 5 second CPU was about 76%/36% with the HMATM Learn proc taking about 36% of the CPU. Now it is 66%/39% with 25% going to this process.
Again, it varies, but it is well above what baseline was before the reboot.
I saw that, but nothing in there referenced the HMATM Learn proc. (It was included in sample outputs, but CPU was always 0% and not mentioned as a cause.)
I have received over 14,000,000 broadcasts on one of the interfaces (a link to a downstream L2 4003). However, I don't know if that was normal or not prior to the reboot. I learned that HMATM is the Hardware MAC Address Table Manager. But is there really anything to "manage" when a broadcast comes through? I watched the MAC table for a period of time (every 10 or 15 seconds for about 10 minutes) and never saw a significant change in the number of MAC addresses total on the switch. Total MAC addresses for the system ranged from about 96 to 108.
I work better by knowing what I'm working with. Exactly what does this HMATM Learn proc do? What are the conditions that trigger it to do something? From the bit that I've read, it seems that it adds and removes MAC addresses from the hardware table when it sees a new address or when an address expires. If I'm not seeing huge changes in the table, then why else would it be using so much processor?
Finally, I don't know any of our devices that would be using SNAP encapsulation. I could take a network capture and see, but I doubt that will get me anywhere.
All of the other reasons mentioned in there should be showing different symptoms in the proc cpu output if they were applicable.
the document is only to start an analysis.
I wonder if you are using HSRP or some other similar features ?
I'm thinking if the process manages also the MAC addresses used by the device. (MAC address filter of the NIC(s) of the switch itself).
the MAC address activity looks like normal you are not under a L2 MAC flood attack this is sure.
the number of broadcast should be compared with the total packets received to see if it is normal or too high.
I would suggest during a maintanence window to try to shut down the port with high broadcast to see if there are changes in the cpu usage
Have you verified if the reload was caused by a power-cycle or for a SW crash ?
There could have been any change in services: new servers or something like that in the network in the last weeks.
Hope to help
We are using HSRP. However, I fall back to the fact that we were using HSRP prior to the reboot and it wasn't doing this.
The reload was due to a power cycle. At least, the switch reports it returned by power-on.
We activated a new core switch with new sup720's within the last few weeks. This core switch is the switch with which this 3550 partners in HSRP. However, everything was fine, literally, until it rebooted. Our graphs show it was immediately at that point when CPU use left baseline and has not returned. And the new core switch was online and partnered in HSRP before this reboot.
I may do a capture on this port; perhaps that will show me a misconfigured host. But this is a radiology network (I work for a hospital) and new servers don't go on it without us helping to assign ports, IP addresses, etc. So I'm reasonably certain it isn't a new server.
I've asked you about HSRP because on C6509 we have noticed that they become crazy with cpu up to 100% if two are claiming to be Active at the same time for the same group. This happened two times: first for a misconfiguration that joined two VTP domains making to communicate two vlans that had been designed separated, the second one for a problem involving monitor sessions and FWSM.
Verify that both agree on active router identity.
If the broadcasts are ARP messages they need to be processed by the CPU and this explains the increase in cpu usage.
Also they could be DHCP requests.
Hope to help