Re: SG300-20 slow response to ICMP/SNMP directly after SNTP upda

hanskbakke · ‎06-06-2013

Hello

I have a strange problem with a SG300-20 switch. I use check_mk (together with icinga) to monitor this switch using both icmp (check_icmp test) and SNMP. This issue is the same even if just ICMP pings is used for monitoring, so the SNMP part is not important.

My issue is that at least once in an hour, this switch "fails" in that it uses several seconds to respond to the icmp_pings, which again is more than the upper limit set for the check. check_icmp is quite aggressive and sends 5 pings with more or less no delay at the same time.

# The path (gigabit on all connections)

icinga -> gw_server -> Cisco SG300-20

Since none of mye other monitored devices (no more SG300s), including a SB300-08 further out in the same subnet hanging on the SG300-20 and an old printer, has any issues with these checks I began to investigate. The switching of traffic it self is unaffected, it seems to just be the management interface.

I enabled full traffic capturing using tcpdump on the gateway and correlated the traces in Wireshark with the alert timestamps to see what was really going on. There I could clearly see that everytime the switch would fail a check or timeout it was directly after a SNTP update (the icinga server is alsothe NTP server). Normally the switch responds to the 5 icmp echo request immiediately, but if the check was directly following a SNTP update it would delay answering for several seconds.

There were no errors in the logs and the CPU usage is next to none on the switch (i have not been able to monitor if there is any spikes for a few seconds just after the NTP update). The NTP server updates give no issues with other devices, including SB200-08.

My guess is that NTP processing is blocking the management CPU for a few seconds, but this is just a wild guess.

Firmware version on SB300-20 when I started to note this issue was 1.3.0.59, and I upgraded to 1.3.0.62 with no improvements in behaviour.

hanskbakke · ‎06-06-2013

I forgot one crucial bit of information. Based on the findings in the trace I deactivated SNTP updates on the SG300-20. It has not failed a single check in 12 hours, where it before would fail at least once an hour.

There is in other words no question in that the sporadic issue is related to the SNTP updates on the switch.

Note: SB300-08 and SB200-08 is really SG200-08. It was just a bit to early in the morning for me..

Tom Watts · ‎06-07-2013

Hans, can you give a configuration example? Right now I just set up an unicast SNTP from the internet 64.90.182.55. Would this be sufficient for findings or?

Edit, so far I've not had any packet loss or issues.

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

hanskbakke · ‎06-08-2013

After having tried anything, including firmware upgrade, pulling the power, reboot and so on without success I disabled all ntp settings and reenabled the single unicast poll to the internal NTP server. I noticed the offset was quite large so I disabled and reenabled it several times which seemed to make the offset less.

However now the issue has not reappeared again. The only real difference is that I now use the FQDN of the host instead of the IP, but I doubt this is related. The packet capture trace looks the same involving the same IPs except that the switch no longer blocks for several seconds.

Could this have been an issue of too large time gap which the switch struggled to close?

I will post here again if the issue reappears.

Tom Watts · ‎06-08-2013

Hi Hans, I do not know your answer yet. When using a NTP server randomly from the internet with very basic settings SNTP enable, the server defined and poll enabled, I have not dropped packets since yesterday when I posted. My server status is up and the clock status is synchronized.

Obviously our difference is, you have an internal time server, I do not. That shouldn't make any difference of course but it is worth noting since I can't recreate using a public time server over the web.

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

hanskbakke · ‎06-08-2013

It could be an compatibility issue with the server, but I have used this setup in many scenarios without issue. The server have not been restarted between the previous test with issues and the currently working setup.

The server is a Debian Stable (Wheezy) physical server using ntp installed via apt-get. The /etc/ntp.conf rule controlling the access for the switches are:

restrict -4 default kod notrap nomodify nopeer noquery

This is the default configuration for the ntp server in Debian Wheezy.

This is my intended, as in currently running, configuration:

(Note that this is taken when things works (as in currently), and except for the server being replaced with the IP 10.0.7.10 this _should_ be the same as when I had issues, but it is not guaranteed)

clock timezone " " 1

clock summer-time web recurring eu

clock source sntp

sntp unicast client enable

sntp unicast client poll

sntp server ntp.proikt.com poll

sw1#show sntp status

Clock is synchronized, stratum 3, reference is ntp.proikt.com, unicast

Unicast servers:

Server : ntp.proikt.com

Source : Static

Stratum : 3

Status : up

Last Response : 17:03:21.0 web Jun 8 2013

Offset : 318.4142654 mSec

Delay : 0 mSec

Anycast server:

Broadcast:

sw1#show sntp configuration

SNTP destination port : 123 .

Polling interval: 1024 seconds.

No MD5 authentication keys.

Authentication is not required for synchronization.

No trusted keys.

Unicast Clients: Enabled

Unicast Clients Polling: Enabled

Server : ntp.proikt.com

Polling : Enabled

Encryption Key : Disabled

Broadcast Clients: disabled

Anycast Clients: disabled

I have added the actual packet capture displaying the issue from the perspective of the gateway. 10.0.0.10 is the switch, and 10.0.7.10 is the monitoring host and NTP server. You will see that after NTP requests/response the echo request/reply pattern will go from alternating between the switch and the monitoring host to 5 requests, a couple of seconds of delay, and then the response.

The capture might be useful to see if there is something there I don't see.

The tcpdump had the filter 'host 10.0.0.10', so all traffic destined or sourced from the switch directly should be included, but nothing else.

hanskbakke · ‎06-08-2013

I have now changed it back to use only the IP just to check.

hanskbakke · ‎06-17-2013

I ran the system using only the IP, and then back to using the FQDN again, and I have had no issues the last week.

In other words clearing all NTP settings and ensuring the time difference was small seemed to clear all issues.

Sadly I have no conclusions to make. Even if something was not configured correctly it should not have been behaving like it did. But as the issue can't be repeated any more I am happy.

hanskbakke · ‎11-04-2013

After adding another trunk interface on my gateway router this switch started to misbehave again, once again directly after ntp requests. And now I found the actual cause of this strange behaviour after a day of intense troubleshooting and packet capturing.

It was actually caused by some, to me, strange default settings in the linux kernel. This is important because the router/gateway that sits between the switch and the monitoring software is running a modern Debian installation.

The problem was caused by me having several interfaces on the linux router that in a subtle way was in the same VLAN. I have one interface which is a dedicated access port i both ends that is the LAN that the switch is reached on. But I do also have another interface on the router that is a trunk port. I am not using the default VLAN at all, hence it is not routed or allowed through the firewall in any way. It isn't even added in the network configuration, it is just there as a consequence of the other active VLANs on the Linux interface.

Because I have legacy switches that do not support having their management IP on something other than VLAN1 I have to keep VLAN1 as the management interface on the SB300, which is also the default (native) VLAN.

Then fun part is that the switch would normally use the dedicated interface for management traffic to or from the other subnets, but when NTP was triggered it did a arp request that was actually replied by the default VLAN on the trunk interface on the Linux server, even though the address is only active on the dedicated access port. This of course made the ntp request fail as this VLAN-interface is not usable on the gateway, and it also made the replies to the management traffic go to the same dead-end interface for a short period of time until the switch updated itself with the correct interface-address again. This relatively short period of having the wrong ARP-mapping is why the issue was only sporadic,and it was also the reason for me watching the switch beeing unable to use NTP properly.

To announce all IPs of all interfaces on all interfaces in the same VLAN/network is for some reason the standard behaviour in the Linux kernel, something I think is very strange as it subtly breaks all but the most basic network configurations.

The fix is easy on the linux router:

# Prevent arp from replying and announcing all addresses on all interfaces (default)

net.ipv4.conf.all.arp_ignore=1

net.ipv4.conf.all.arp_announce=2

In other words, the SB300 was not to blame, even thoug it was where the symptoms became visible.

Tom Watts · ‎11-23-2013

Hi Hans this bug was reproduced by the development and engineering teams and is confirmed fixed on the 1.3.5 software. If you upgrade to the 1.3.5 software and experience anything similar, please share.

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

SG300-20 slow response to ICMP/SNMP directly after SNTP update

Cisco Business Product Family

Cisco Switching Product Family