We have a large number of branches with Cisco887VAG routers. They are set-up with 2 x IP SLA monitors, which are tracked such that if both IP SLA's fail then the router will switch from ADSL to it's 3G connection. IP SLA 1 points to a device on our ISP's network and IP SLA 2 points to a device on the Internet (the idea being that if one fails, it may not necessarily indicate an IP connectivity issue - could just be that particular device that is down but if both fail we can be more confident there is a problem with connectivity over ADSL so we should failover to 3G)
We also monitor the IP SLA's via our monitoring station (using SNMP) This morning, at exactly the same time, we noticed that IP SLA 2 started failing across a large number of devices (but by no means them all)
Sure enough on logging into the routers concerned, we could see that the IP SLA was showing a timeout and attempts to ping the address used by the SLA monitor also timed out. Previously, this sort of thing has been an indication of a problem with our ISP's internet transit - however, on this occasion, everything else seemed fine - apart from ICMP communications with this particular address. We could ping other internet addresses without issues and our sites were not reporting any comms issues.
I logged with the ISP but they say they are not blocking any traffic.
I tried rebooting a few routers but they could still not get ICMP response from this particular internet address.
I've ended up changing the address used by the IP SLA 2 monitor but I'm worried the same thing could happen again and also curious as to what could be "blocking" the ICMP traffic in the first place. I can't see how anything at our end (or on the routers themselves) could be at fault so but, at the same time, the ISP is insistent there are no issues and that they are not interfering with the ICMP traffic.
Does anyone have any idea what the cause of such an issue could be? I guess it could be the internet server that we are monitoring (We don't own it so I guess technically/morally we shouldn't be using it!) that has decided to block ICMP from certain of our addresses but I just can't see them doing that (its a publically accessible DNS server which I'm sure you can probably guess where!).
I think you've just about covered your bases here. Why not use something that you own and manage on our network like your VPN hubs for the IP SLA targets and that you can guarantee connectivity for? Surely that would be a better solution for you.
Thanks for the response - we have actually debated your suggestion internally a few times and could never really make our mind up over the best solution but I will give you some background to elaborate on our thinking! In actual fact, our VPN hub uses the same ISP as the individual branches and the IP SLA1 target is really the "default gateway" for our ISP from the VPN hub (effectively the same thing as using our own VPN hub though we have it configured not to respond to ICMP so it was simpler just to use the ISP gateway as the target).
The main reason we've used an "internet" device for IP SLA2 is that we have had problems with this ISP's internet transit in the past and this always gave us a good indication of when that sort of problem was happening - we are becoming more and more dependent on internet connectivity at the branch sites for various business activities so it was always good to have a "heads up" on any problems in that area.
In fact, when we saw the IP SLA2 drop on our monitoring for all those devices this morning we initially thought "oh no! Here we go again, internet issues" but on investigating further, the only apparent problem this time seemed to be the lack of ICMP response from the IP SLA2 target for those affected devices.
To confuse things further, all of a sudden, we are now happily seeing ICMP responses from this particular target from the devices that were affected. I contacted the ISP again but they claim not to have taken any action and are still unable to see any sign of a problem.
It would just be nice to get an idea of why it might have happened but it looks like it will remain unsolved!