Solved: Very weird issue last night

John Blakley · ‎10-01-2010

All,

First a little background:

I had an issue last night that I was able to track down to at least 3 workstations. About 6:30PM, I started getting notifications from my monitoring server that all of my VPN sites were down, and then the big one came: my load balancer wasn't responding. I was able to connect to the vpn, which goes through the load balancer so that was strangeness #1. Long story short, I ran wireshark on a workstation that I remoted into because my core switch (stack of 3750s) processor usage was at 99%. The Mwheel (what process is this??) process was at 60%. In wireshark, I was seeing a server throwTONS of random multicast address pings. Someone I work with got on that server, and it had a notification of w32.spybot worm that our AV caught. He rebooted the server and all was well. The processor went down to 30% on the switch stack.

I then looked at my multicast routing table and filtered it by my local subnet "sh ip mroute | in 10.12" OMG....there were literally HUNDREDS of these random addresses from 3 computers (not including the one above). I remoted into another of these boxes and put wireshark on it. The AV found the virus, but before I rebooted I started a capture to see if it was doing the multicast ping and it was. After reboot, it stopped.

Now the question:

Since the switch was saturated, the monitoring server couldn't ping out consistently. The switch had a message in the log that stated something like, "A multicast storm was detected....." and "Recv queue starved...." (Not exact errors.) I could ping the last device that's in the chain before it hits the outside world, and I'd lose packets, BUT (here's where the weirdness comes in) I could ping PAST the device and not lose any traffic. So it looks like this:

host -> switch -> 192.168.1.5 -> Internet -> 4.2.2.1

Pinging from host would result:

Reply from 192.168.1.5

Reply

Request timed out

Reply

But pinging from the same host, at the same time, would result in replies:

Reply from 4.2.2.1

The only thing that I can figure is that the device was having to process the icmp packet when it would come through and it was too overwhelmed to generate a reply, whereas in the second scenario it was just needing to send it through. Am I on the right track? What is the mwheel process? I'm assuming that it's multicast related.

Thanks!

John

HTH, John *** Please rate all useful posts ***

Jon Marshall · ‎10-01-2010

John

I think mwheel is one of the timers multicast uses within the IOS ie. within the IOS there are a number of Timer Wheels used for different things and multicast uses one but i can't be any more specific than that.

As for your ping issue. Not entirely clear what device 192.168.1.5 is but if it was the switch or a network device having to respond to ping then ys you have answered your own question. An ICMP packet going through will simply be hardware switched whereas an ICMP for that device will need to be handled by the main CPU and as that was busy dealing with all the multicasts it would be unable to process the packet in time.

Jon

View solution in original post

Jon Marshall · ‎10-01-2010

John

I think mwheel is one of the timers multicast uses within the IOS ie. within the IOS there are a number of Timer Wheels used for different things and multicast uses one but i can't be any more specific than that.

As for your ping issue. Not entirely clear what device 192.168.1.5 is but if it was the switch or a network device having to respond to ping then ys you have answered your own question. An ICMP packet going through will simply be hardware switched whereas an ICMP for that device will need to be handled by the main CPU and as that was busy dealing with all the multicasts it would be unable to process the packet in time.

Jon

John Blakley · ‎10-01-2010

Thanks Jon....

John

HTH, John *** Please rate all useful posts ***

Giuseppe Larosa · ‎10-02-2010

Hello John,

>> The only thing that I can figure is that the device was having to process the icmp packet when it would come through and it was too overwhelmed to generate a reply, whereas in the second scenario it was just needing to send it through. Am I on the right track?

Yes, here we see CEF and multilayer switching in action: when the packet has a destination beyond the switch it is multi layer switched when the destination is the switch management address it has to be process switched = sent to main cpu that was very busy attempting to create states for so many multicast routes.

And all this has been started by a virus running on a PC or three PCs. So this can be seen as an example of why security measures have to be taken everywhere including on end user devices.

This has been a real Denial of Service and you had to fight a long battle to fix it.

Hope to help

Giuseppe