05-28-2010 02:08 AM - edited 03-06-2019 11:19 AM
I am trying to troubleshoot an issue where one of our core 4506's has a very slow response time for a couple minutes once a week at exactly the same time.
Response times go from <1ms to 1.5s and then return to normal. This affects some servers / services more than others with some losing connectivity completely.
The logs say nothing the processor usage doesn't really change
TAC have looked at logs and can find no obvious issues.
There are no big backup jobs starting at this time (I doubt that that would kill the switch even if they were)
Could anyone give me any ideas of any possible causes of this? We are now really scratching our heads!
C4506 running Sup IV
cat4500-entservicesk9-mz.122-31.SG.bin
05-28-2010 03:48 AM
Hi,
How many modules are connected to this switch.. Do users or servers connected on single module experiences the latency problem or all the users or servers on all modules experience the problem..
Try to rule out the hardware issues with modules on switch first.
HTH
Hitesh Vinzoda
Please rate useful posts.
05-28-2010 04:48 AM
Since this happens once a week at the same exact time my guess would be something is taking up the bandwidth. I do not think this would be a problem with the physical switch but don't rule that out. Couple things to try.
The day before this happens and a couple minutes before the time it happens, so let say this happens on Tuesdays at 9:30 am and lasts until 9:35am, then on Monday at 9:25am clear the counters on all interfaces and then at 9:40am capture the show interfaces output. This will get a base line to use. Then on Tuesday at 9:25am clear the counters again and at 9:40am capture the show interfaces output and compare to see if on interface stands out as getting much more traffic than the day before.
I know this is very basic but it might lead you down a path.
You may want to check with people and ask if there are any jobs at all running during that time frame. We found a very similar issue here a few years back and this is how we tracked it down. Every day at the same time during lunch this person had a scheduled task to go out and down load some reports from a website, it was work related, She had been doing it manually for awhile so it would not always happen at the same time and then she found out how to do it automatically and that is when we could track it down because we saw the spike in traffic on her interface.
Mike
05-28-2010 04:00 PM
I have been doing some reading and it appears that something is hitting the CPU when it should be being processed in hardware?
I have been reading this
http://www.cisco.com/en/US/products/hw/switches/ps663/products_tech_note09186a00804cef15.shtml#tools
got the following stats out of the switch
Switch#show platform cpu packet statistics
Queue Total 5 sec avg 1 min avg 5 min avg 1 hour avg
---------------------- --------------- --------- --------- --------- ----------
Esmp 2668726664 213 224 180 169
L2/L3Control 491666442 41 34 33 26
Host Learning 5402859 0 0 0 0
L3 Fwd High 4 0 0 0 0
L3 Fwd Medium 65 0 0 0 0
L3 Fwd Low 721198011 45 43 33 32
L2 Fwd High 369852 0 0 0 0
L2 Fwd Medium 252 0 0 0 0
L2 Fwd Low 171320570 8 6 6 5
L3 Rx High 1718602 0 0 0 0
L3 Rx Low 413167674 12 8 9 9
RPF Failure 3310 0 0 0 0
ESMP seems to sending the most traffic to the CPU -
What is ESMP and how do I find out more info
I will be setting up a SPAN session to the CPU and sniffing that - not sure if anyone has any more clues now?
Roger
05-28-2010 06:00 PM
You may want to check with people and ask if there are any jobs at all running during that time frame
I'm not criticizing your opinion Mike (as a matter a fact this deserves a +5) but I want to be cynical to the "people" around the owner of the thread. It's either something does happen that they are unaware or unwilling to disclose: some kind of backup, a server trying to "phone home" or a mis-configuration of a server or host. I have seen a mis-configured server that every time it does it's backup it would flood the network with broadcast or some sort.
Get netflow or run link-utilization reports so you'll know what is causing your situation. You can also check the CPU of your supervisor engine. Check the logs of your switch because it can sometimes help (just make sure you enable interface logging).
05-28-2010 07:34 PM
We had an issue once with a lan guy who was ghosting several servers and when he ran the ghosting program it was running in multicast mode and let me tell a single server doing this procedure can bring a 6500 Sup 720 to its knees when its not set up for multicasting. Look at your interface for high multicasting counts. Just something to check when it happens.
05-28-2010 07:45 PM
Nice war story you have there Glen (+5).
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide