01-22-2020 04:02 AM
Hello,
Looking for some advice on how to trace alleged latency on our core network.
Network comprises 2x C6509 with Sup720's, approx 50 VLAN's - the 6509's configured in HSRP for each VLAN.
A particular VLAN which uses a specialised application brought to our attention an application which appears to disconnect every 30mins, exactly on the hour and half-hour. Looking at the app further, it does appear to do this, so we started looking at the core switches.
I realise that using ICMP to the L3 SVI interface is not a great indication to anything, nonetheless doing this showed up a strange behaviour - baseline ping to any L3 SVI appears to be in or around the 1-5ms range, however at the top of the hour this rises to 500-1000ms for a few seconds then drops back again. Using a batch file to save ICMP responses, I have noticed this behaviour on all L3 SVI's that I tested it on. Trying the same on a machine behind the SVI on an access switch shows, in general, just one dropped ICMP (on the hour and half-hour also) before going back to normal.
The application that is having the issue is old, and appears to have a maximum retry attempt of 3, so our theory right now is that this peak on the hour and half-hour on the cores might have something to do with it. Even if it does not, it would certainly be interesting to find out what exactly is causing this behaviour. Judging by the behaviour (see sample ICMP log attached), it must be man made, but my question is how should I go about tracing this.
I have, so far, tried the following:-
1. Disabled most SNMP logging on the core switches, and changed the SNMP access list to log SNMP packets - thought the culprit might be a large SNMP query - this does not appear to be the case.
2. Looked at the HSRP configuration and HSRP failover stats for each VLAN - no failovers or any other HSRP behaviour that happens at the exact time of the above.
3. Ran an ICMP to the virtual HSRP GW address, and both HSRP interface addresses in tandem - this exhibited the same behaviour across all three addresses on the hour and half hour.
4. Ran a ndebug netdr, but cannot see any different packets or activity at the exact time that could account for this.
I'm wary of debugging too much, as already lastnight I made the mistake of debugging HSRP terse on one of the cores, thinking that the buffered logging configured would save me from any high CPU, but my telnet to the core came under pressure so disabled this logging again.
It would appear to be some burst of traffic, and given the clockwork nature of the event I'd hope to be able to identify its source, but so far have not.
Could anybody outline a suitable troubleshooting strategy to determine the cause of this, or indeed any thoughts on it would be most welcome.
Kind regards
Ger
PS - log attached is a ping to the HSRP virtual address of one VLAN (on one of the cores)
01-22-2020 05:56 AM
01-23-2020 01:55 AM
Hi Mark,
Thanks for the reply.
The CPU's run approx 50% presently, and there in deed could be a spike in processor, which I've looked for but cannot pin it on the time in question. The clockwork nature, and it happening every half-hour on the dot is the part that bothers me - it doesn't sound like CPU (although clearly it could).
There are indeed some TCN changes - I'm in the process of ironing these out, but I don't think their levels are such to explain this behaviour, so I'll probably look at doing a monitor session with wireshark for one of the VLAN's in question.
Many thanks again.
Ger
01-23-2020 01:58 AM
Mark,
Just a question regarding the script you listed in the previous post - I see it does a platform packet debug.
This being our core switches, I'm loath to do much in the way of loading it lest it fails; as we've redundant links to many of our closets, there are some that don't and I'd not want the switch to start really dropping packets. Is the script safe to run during production, or should a maintenance window be gotten instead?
Thanks,
Ger
01-23-2020 03:06 AM
01-23-2020 02:48 AM
Mark,
A quick final question regarding your attached script if I may. It appears I don't have some of the commands required in the script - for example I don't have 'sh platform cpu' or 'sh platform health'. The IOS currently running on the 720's is AdvancedIPServices 12.2(33)SXJ8. Are there alternatives for this version, or should I investigate upgrading the IOS?
Thanks again,
Ger
01-23-2020 04:14 AM
01-23-2020 07:12 AM
Mark,
Thank you for your comments - I'll run with some of that and see how I get on.
Thank you again for your time.
Regards
Ger
01-23-2020 08:53 AM
01-30-2020 05:22 AM
Hello,
Just a quick update on the above problem....
After implementing the applet to log CPU usage, there is indeed CPU spikes over 70% being logged with the ARP process going to approx 35%, and the IP process to 28%.
This appears to (mostly) correspond with the packet drops experienced.
As it is a large network, it is hard to get anything concrete from netdr, or even wireshark, but what I have seen is quite a lot of IPV6 multicast solicit broadcasts to 33.33.x.x.x.x during and around every hour and half-hour. I have tried to limit these through control plane policing but have as yet been unsuccessful in eradicating the overall problem. What I think is happening is that these packets are so numerous that other traffic is getting sidelined, for the 500ms or 1 second length they are being transmitted, and seeing as the netdr capture shows mostly ARP (0x806) and IPV6 (86DD) packets, it appears these are being punted to the rp CPU.
On our busiest vlan, the arp level is very high, and I'll next need to trace why this is - as per the vlan switching stats in the pic below.
I'll update here if/when I find anything further.
Thanks again for the assistance here, and any further suggestions most welcome.
01-30-2020 06:04 AM
01-30-2020 06:49 AM
Hello Mark,
No - the interrupt utilisation stays pretty steady in and around the 10-15% mark - your script was returning figures such as 75%/14%, 83%/12%, and the like. The top 3 processes in every case of high processor were ARP and IP Process - cant recall offhand what the third was, but it was quite low - around the 4% mark.
There are certainly a lot of static routes on the core switches, as the government WAN we are connecting to has not fully updated their routing protocols as yet. Nonetheless, I have gone through each route on the cores, and none point to an interface - they all generally point to the HSRP gateway address on our edge routers.
Thank you
01-30-2020 08:28 AM
01-30-2020 12:30 PM
Its a good point - I had a look at the application log already for the system that initially brought this up. It does time out on the half-hour, as described previously, and I can see the app trying to do retries. Looks like it will only do 3 retries - and we need maybe 4 to get over the hum of the spike. I'm going to talk to the application people to see if the retry time can be extended.
Nonetheless, I'm going to keep trying to get to the bottom of this spike, if nothing else, but to understand if possible where it is coming from. You are right in that the core CPU never really maxes out, except for the odd 99%, and with that the interrupt side is never past 20%, so the thing isn't on the edge, yet at least......
Taking your earlier comments into account, I have tried end to end traffic through the core, from one vlan to another - this does drop 1 to 2 packets on the half hour in tandem with the spike, but as ICM is low priority this could be expected. I'll try an actual application once I get a chance. Having said all that, with the exception of the app above, nobody else is really complaining so would assume all other apps are fine....
Thank you again.
Regards
Ger
01-31-2020 03:52 AM
Hi
i agree i would say this is app specific and some tweaking on app side and switch side with some constant priority may stop the drop
when you span the port say to the app , at the half hour drop do you see and break in the application sequence , the app guys should be able to tell you what the app expects to see how it works and whats formation it should occur and the order of sequence , if thats out of order it may show why in the wireshark at that 30 minute drop , maybe apps not receiving something it should or receiving it in wrong formation, it may not be the switch at all and somethings happening at the app level , thats getting very detailed but thats why it may not have been seen yet as focusing on cpu issue that may or may not be the cause but doesn't look high enough or constant enough to cause it , if cpu on 6509 get hit hard usually everything takes a hit with it not 1 app , it cant really make that call at that stage as its screwed itself and mostly likely in some type hung state with processes
anway just thoughts of what else could try if getting to bottom of spikes doesn't help
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide