Cisco C6509 Sup720 Latency every half hour

GR5516 · ‎01-22-2020

Hello,

Looking for some advice on how to trace alleged latency on our core network.

Network comprises 2x C6509 with Sup720's, approx 50 VLAN's - the 6509's configured in HSRP for each VLAN.

A particular VLAN which uses a specialised application brought to our attention an application which appears to disconnect every 30mins, exactly on the hour and half-hour. Looking at the app further, it does appear to do this, so we started looking at the core switches.

I realise that using ICMP to the L3 SVI interface is not a great indication to anything, nonetheless doing this showed up a strange behaviour - baseline ping to any L3 SVI appears to be in or around the 1-5ms range, however at the top of the hour this rises to 500-1000ms for a few seconds then drops back again. Using a batch file to save ICMP responses, I have noticed this behaviour on all L3 SVI's that I tested it on. Trying the same on a machine behind the SVI on an access switch shows, in general, just one dropped ICMP (on the hour and half-hour also) before going back to normal.

The application that is having the issue is old, and appears to have a maximum retry attempt of 3, so our theory right now is that this peak on the hour and half-hour on the cores might have something to do with it. Even if it does not, it would certainly be interesting to find out what exactly is causing this behaviour. Judging by the behaviour (see sample ICMP log attached), it must be man made, but my question is how should I go about tracing this.

I have, so far, tried the following:-

1. Disabled most SNMP logging on the core switches, and changed the SNMP access list to log SNMP packets - thought the culprit might be a large SNMP query - this does not appear to be the case.

2. Looked at the HSRP configuration and HSRP failover stats for each VLAN - no failovers or any other HSRP behaviour that happens at the exact time of the above.

3. Ran an ICMP to the virtual HSRP GW address, and both HSRP interface addresses in tandem - this exhibited the same behaviour across all three addresses on the hour and half hour.

4. Ran a ndebug netdr, but cannot see any different packets or activity at the exact time that could account for this.

I'm wary of debugging too much, as already lastnight I made the mistake of debugging HSRP terse on one of the cores, thinking that the buffered logging configured would save me from any high CPU, but my telnet to the core came under pressure so disabled this logging again.

It would appear to be some burst of traffic, and given the clockwork nature of the event I'd hope to be able to identify its source, but so far have not.

Could anybody outline a suitable troubleshooting strategy to determine the cause of this, or indeed any thoughts on it would be most welcome.

Kind regards

Ger

PS - log attached is a ping to the HSRP virtual address of one VLAN (on one of the cores)

Mark Malone · ‎01-22-2020

Hi
ICMP is controlled in COPP in the 6500 , if the switch is under pressure in any way it can be delayed or dropped thats why its not a valid test to run ICMP to an SVI as its logical int on the switch so testing through the switch end to end is recommended using something like ip sla

saying that though this still looks way to much of a jump so id say something happened here either on the switch or in path to it from the device running the ping , did you try an capture this as it was happening
22/01/2020 10:30:01 Reply from 10.121.173.254: bytes=32 time=539ms TTL=255

if though after your tests you think its burst traffic you can use wireshark to see it , if you dont have any LAN software that can graph issue like that , as an option

https://www.cisco.com/c/en/us/support/docs/lan-switching/switched-port-analyzer-span/116260-technote-wireshark-00.html

is the cpu spiking at all , what does the show proc cou sorted show and the show proc cpu history
You can run a script to capture if the cpu is getting hammered at times and dump it into flash something like the below but localized to the switch

event manager session cli username "MMALONE"
event manager applet HIGH_CPU
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.3.1 get-type exact entry-op ge entry-val "80" poll-interval 30
action 0.0 syslog msg "High CPU Detected"
action 0.1 cli command "enable"
action 0.2 cli command "show clock | append bootflash:HIGH_CPU.txt"
action 0.3 cli command "debug platform packet all buffer | append bootflash:HIGH_CPU.txt "
action 0.4 cli command "show logging | append bootflash:HIGH_CPU.txt "
action 0.5 cli command "show process cpu sorted | append bootflash:HIGH_CPU.txt "
action 0.6 cli command "show process cpu history | append bootflash:HIGH_CPU.txt"
action 0.7 cli command "show buffers| append bootflash:HIGH_CPU.txt"
action 0.8 cli command "show platform health | append bootflash:HIGH_CPU.txt "
action 0.9 cli command "show platform cpu packet statistics all | append bootflash:HIGH_CPU.txt "
action 1.0 cli command "show platform cpu packet buffered | append bootflash:HIGH_CPU.txt "
action 1.1 cli command "debug platform packet all count | append bootflash:HIGH_CPU.txt "
action 1.2 cli command "show spanning-tree detail | append bootflash:HIGH_CPU.txt "
action 1.3 cli command "u all"

Check also that STP is stable and there is not a lot of changes occurring at layer 2 constantly aswell mass amounts of TCN changes could cause issues with switch processing traffic as its under pressure trying to rectify STP issues

show spanning-tree detail | inc ieee|occurr|from|is exec

Capture teh show tech while its high or issue is occurring and you could also run it through the Cisco Cli analyzer see if that sees anything

GR5516 · ‎01-23-2020

Hi Mark,

Thanks for the reply.

The CPU's run approx 50% presently, and there in deed could be a spike in processor, which I've looked for but cannot pin it on the time in question. The clockwork nature, and it happening every half-hour on the dot is the part that bothers me - it doesn't sound like CPU (although clearly it could).

There are indeed some TCN changes - I'm in the process of ironing these out, but I don't think their levels are such to explain this behaviour, so I'll probably look at doing a monitor session with wireshark for one of the VLAN's in question.

Many thanks again.

Ger

GR5516 · ‎01-23-2020

Mark,

Just a question regarding the script you listed in the previous post - I see it does a platform packet debug.

This being our core switches, I'm loath to do much in the way of loading it lest it fails; as we've redundant links to many of our closets, there are some that don't and I'd not want the switch to start really dropping packets. Is the script safe to run during production, or should a maintenance window be gotten instead?

Thanks,

Ger

Mark Malone · ‎01-23-2020

Thats just an example , you can change each command to what you need , there scripts i got from TAC and altered or saved when chasing strange cpu issues on some switches ,that was off an ios-xe 3k switch , this one below is from a ios switch , but you can insert whatever commands you want and remove what you dont need , in the actions section of the script , keep the process and logging commands though they will be useful

you could remove the platform commands and replace with that STP command , that would be good to collect while CPU is spiking , also you could take a show tech while the issue is hot , would be useful as Cisco cli analzyer can decode HW and SW cpu issues too , you can dump the file into it and it can give a report back , its free tool on page below

useful to check this before opening a TAC case if you cant pin point it

https://www.cisco.com/c/en/us/support/web/tools-catalog.html

event manager applet High_CPU
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.4.1 get-type exact entry-op ge entry-val "70" exit-time 10 poll-interval 5
action 0.1 syslog msg "CPU Utilization is high"
action 0.2 cli command "enable"
action 0.4 cli command "show log | append flash:CPU_Profile.txt"
action 0.5 cli command "show process cpu sorted | append flash:CPU_Profile.txt"
action 0.6 cli command "show interfaces | append flash:CPU_Profile.txt"
action 0.7 cli command " show ip cef switching stat | append flash:CPU_Profile.txt"
action 0.8 cli command " show ip traffic | append flash:CPU_Profile.txt"
action 0.9 cli command " show int switching | append flash:CPU_Profile.txt"
action 1.0 cli command "no event manager applet High_CPU"
action 1.1 cli command "end"

GR5516 · ‎01-23-2020

Mark,

A quick final question regarding your attached script if I may. It appears I don't have some of the commands required in the script - for example I don't have 'sh platform cpu' or 'sh platform health'. The IOS currently running on the 720's is AdvancedIPServices 12.2(33)SXJ8. Are there alternatives for this version, or should I investigate upgrading the IOS?

Thanks again,

Ger

Mark Malone · ‎01-23-2020

no you dont need to upgrade , that was just an example i should of explained to change it to suit your needs yesterday , there mostly ios-xe

the main one you need is the processor cpu commands / show log and spanning tree

When cpu hits 70% it will run the commands and pull that info , from there we should be able to see what process is causing the spike , then the script can be removed when not needed

GR5516 · ‎01-23-2020

Mark,

Thank you for your comments - I'll run with some of that and see how I get on.

Thank you again for your time.

Regards

Ger

Mark Malone · ‎01-23-2020

ye no problem let me know how you get on if it captures anything useful in the proc cpu output

GR5516 · ‎01-30-2020

Hello,

Just a quick update on the above problem....

After implementing the applet to log CPU usage, there is indeed CPU spikes over 70% being logged with the ARP process going to approx 35%, and the IP process to 28%.

This appears to (mostly) correspond with the packet drops experienced.

As it is a large network, it is hard to get anything concrete from netdr, or even wireshark, but what I have seen is quite a lot of IPV6 multicast solicit broadcasts to 33.33.x.x.x.x during and around every hour and half-hour. I have tried to limit these through control plane policing but have as yet been unsuccessful in eradicating the overall problem. What I think is happening is that these packets are so numerous that other traffic is getting sidelined, for the 500ms or 1 second length they are being transmitted, and seeing as the netdr capture shows mostly ARP (0x806) and IPV6 (86DD) packets, it appears these are being punted to the rp CPU.

On our busiest vlan, the arp level is very high, and I'll next need to trace why this is - as per the vlan switching stats in the pic below.

I'll update here if/when I find anything further.

Thanks again for the assistance here, and any further suggestions most welcome.

Mark Malone · ‎01-30-2020

Hi
are your static routes pointing to a next hop ip address and not a interface , if its an int liek g1/0/1 this can cause huge ARP spikes in CPU

when you collected that show proc sorted , was the interrupt traffic high its this number below thats 1% or was that low , if its high could have alignment issues too or software issue which can also cause issues

sh proc cpu sorte
CPU utilization for five seconds: 9%/1%; one minute: 10%; five minutes: 10%

GR5516 · ‎01-30-2020

Hello Mark,

No - the interrupt utilisation stays pretty steady in and around the 10-15% mark - your script was returning figures such as 75%/14%, 83%/12%, and the like. The top 3 processes in every case of high processor were ARP and IP Process - cant recall offhand what the third was, but it was quite low - around the 4% mark.

There are certainly a lot of static routes on the core switches, as the government WAN we are connecting to has not fully updated their routing protocols as yet. Nonetheless, I have gone through each route on the cores, and none point to an interface - they all generally point to the HSRP gateway address on our edge routers.

Thank you

Mark Malone · ‎01-30-2020

Hi
Ok cool that sounds right in both points , have you tried to set the application packets into QOS to see if that prevents the app dropping off , even if CPU spikes to that high as long as it doesn't sit at it for days running at say 85% it shouldn't really cause a drop in traffic on the switch ,ive seen before when really high it can get sluggish but usually your mgmt traffic will be dropped first to make sure the prod traffic is ok and can get through

Just wondering if some OQS may allow the app tats sensitive to go un impeded , we have some very sensitive apps here too and we drop them in the AF41 video traffic queue as we used to issues when it was not prioritized , just an option if not already in place

GR5516 · ‎01-30-2020

Its a good point - I had a look at the application log already for the system that initially brought this up. It does time out on the half-hour, as described previously, and I can see the app trying to do retries. Looks like it will only do 3 retries - and we need maybe 4 to get over the hum of the spike. I'm going to talk to the application people to see if the retry time can be extended.

Nonetheless, I'm going to keep trying to get to the bottom of this spike, if nothing else, but to understand if possible where it is coming from. You are right in that the core CPU never really maxes out, except for the odd 99%, and with that the interrupt side is never past 20%, so the thing isn't on the edge, yet at least......

Taking your earlier comments into account, I have tried end to end traffic through the core, from one vlan to another - this does drop 1 to 2 packets on the half hour in tandem with the spike, but as ICM is low priority this could be expected. I'll try an actual application once I get a chance. Having said all that, with the exception of the app above, nobody else is really complaining so would assume all other apps are fine....

Thank you again.

Regards

Ger

Mark Malone · ‎01-31-2020

Hi

i agree i would say this is app specific and some tweaking on app side and switch side with some constant priority may stop the drop

when you span the port say to the app , at the half hour drop do you see and break in the application sequence , the app guys should be able to tell you what the app expects to see how it works and whats formation it should occur and the order of sequence , if thats out of order it may show why in the wireshark at that 30 minute drop , maybe apps not receiving something it should or receiving it in wrong formation, it may not be the switch at all and somethings happening at the app level , thats getting very detailed but thats why it may not have been seen yet as focusing on cpu issue that may or may not be the cause but doesn't look high enough or constant enough to cause it , if cpu on 6509 get hit hard usually everything takes a hit with it not 1 app , it cant really make that call at that stage as its screwed itself and mostly likely in some type hung state with processes

anway just thoughts of what else could try if getting to bottom of spikes doesn't help