4506 slow response for a few minutes once a week

simplecisco · ‎05-28-2010

I am trying to troubleshoot an issue where one of our core 4506's has a very slow response time for a couple minutes once a week at exactly the same time.

Response times go from <1ms to 1.5s and then return to normal. This affects some servers / services more than others with some losing connectivity completely.

The logs say nothing the processor usage doesn't really change

TAC have looked at logs and can find no obvious issues.

There are no big backup jobs starting at this time (I doubt that that would kill the switch even if they were)

Could anyone give me any ideas of any possible causes of this? We are now really scratching our heads!

C4506 running Sup IV

cat4500-entservicesk9-mz.122-31.SG.bin

Hitesh Vinzoda · ‎05-28-2010

Hi,

How many modules are connected to this switch.. Do users or servers connected on single module experiences the latency problem or all the users or servers on all modules experience the problem..

Try to rule out the hardware issues with modules on switch first.

HTH

Hitesh Vinzoda

Please rate useful posts.

burleyman · ‎05-28-2010

Since this happens once a week at the same exact time my guess would be something is taking up the bandwidth. I do not think this would be a problem with the physical switch but don't rule that out. Couple things to try.

The day before this happens and a couple minutes before the time it happens, so let say this happens on Tuesdays at 9:30 am and lasts until 9:35am, then on Monday at 9:25am clear the counters on all interfaces and then at 9:40am capture the show interfaces output. This will get a base line to use. Then on Tuesday at 9:25am clear the counters again and at 9:40am capture the show interfaces output and compare to see if on interface stands out as getting much more traffic than the day before.

I know this is very basic but it might lead you down a path.

You may want to check with people and ask if there are any jobs at all running during that time frame. We found a very similar issue here a few years back and this is how we tracked it down. Every day at the same time during lunch this person had a scheduled task to go out and down load some reports from a website, it was work related, She had been doing it manually for awhile so it would not always happen at the same time and then she found out how to do it automatically and that is when we could track it down because we saw the spike in traffic on her interface.

Mike

simplecisco · ‎05-28-2010

I have been doing some reading and it appears that something is hitting the CPU when it should be being processed in hardware?

I have been reading this

http://www.cisco.com/en/US/products/hw/switches/ps663/products_tech_note09186a00804cef15.shtml#tools

got the following stats out of the switch

Switch#show platform cpu packet statistics

Queue                  Total           5 sec avg 1 min avg 5 min avg 1 hour avg
---------------------- --------------- --------- --------- --------- ----------
Esmp                        2668726664       213       224       180        169
L2/L3Control                 491666442        41        34        33         26
Host Learning                  5402859         0         0         0          0
L3 Fwd High                          4         0         0         0          0
L3 Fwd Medium                       65         0         0         0          0
L3 Fwd Low                   721198011        45        43        33         32
L2 Fwd High                     369852         0         0         0          0
L2 Fwd Medium                      252         0         0         0          0
L2 Fwd Low                   171320570         8         6         6          5
L3 Rx High                     1718602         0         0         0          0
L3 Rx Low                    413167674        12         8         9          9
RPF Failure                       3310         0         0         0          0

ESMP seems to sending the most traffic to the CPU -

What is ESMP and how do I find out more info

I will be setting up a SPAN session to the CPU and sniffing that - not sure if anyone has any more clues now?

Roger

Leo Laohoo · ‎05-28-2010

You may want to check with people and ask if there are any jobs at all running during that time frame

I'm not criticizing your opinion Mike (as a matter a fact this deserves a +5) but I want to be cynical to the "people" around the owner of the thread. It's either something does happen that they are unaware or unwilling to disclose: some kind of backup, a server trying to "phone home" or a mis-configuration of a server or host. I have seen a mis-configured server that every time it does it's backup it would flood the network with broadcast or some sort.

Get netflow or run link-utilization reports so you'll know what is causing your situation. You can also check the CPU of your supervisor engine. Check the logs of your switch because it can sometimes help (just make sure you enable interface logging).

glen.grant · ‎05-28-2010

We had an issue once with a lan guy who was ghosting several servers and when he ran the ghosting program it was running in multicast mode and let me tell a single server doing this procedure can bring a 6500 Sup 720 to its knees when its not set up for multicasting. Look at your interface for high multicasting counts. Just something to check when it happens.

Leo Laohoo · ‎05-28-2010

Nice war story you have there Glen (+5).