01-21-2019 06:13 AM - edited 01-21-2019 06:39 AM
Hi folks. We have had IP SLA turned on for ISP failover for a few years now. We have been having issues lately where it fails over to the secondary ISP for several minutes/hours and then switches back to the primary. I have connected a workstation directly on the primary ISP and do not see any problems with the ping. I have also had the ISP vendor of the primary link in and they can't find any problems with their link/equipment. Firewall is an ASA 5515 and below is the SLA configuration. I have turned on message 622001 so I see when the tracked route goes up or down. My question is what is the best way to troubleshoot what is going on? (ie. syslog messages I should track, other parameters to turn on, etc.)
sla monitor 10
type echo protocol ipIcmpEcho 8.8.8.8 interface Eastlink
frequency 5
sla monitor schedule 10 life forever start-time now
track 1 rtr 10 reachability
route Eastlink 0.0.0.0 0.0.0.0 x.x.x.x 1 track 1
route Aliant 0.0.0.0 0.0.0.0 x.x.x.x 2
In the down state the 'show sla monitor operational-state' shows this:
Result of the command: "show sla monitor operational-state"
Entry number: 10
Modification time: 09:12:00.857 AST Mon Jan 21 2019
Number of Octets Used by this Entry: 2056
Number of operations attempted: 500
Number of operations skipped: 500
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 10:35:10.857 AST Mon Jan 21 2019
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0 RTTMin: 0 RTTMax: 0
NumOfRTT: 0 RTTSum: 0 RTTSum2: 0
Thanks. Grant.
01-22-2019 05:08 AM
Hi Grant,
It might be Google doing icmp rate limit. Try using a different target like Eastlink DNS or other reliable IP address in the internet.
01-22-2019 05:26 AM
Thanks, I have switched it to OpenDNS IP which I've heard is more reliable. I had tried the Eastlink DNS IP and the problem was still occurring. The odd part is the randomness of when it happens and how sometimes it switches for 10 minutes and other times 2-3 hours. I have SysLog setup to send traps for 609001, 609002 and 622001. I am seeing the 622001 but not the others. I just did a 'debug sla monitor error' which I believe will turn these on although I haven't seen them yet.
The other thing of note is that we use SourceFire. Maybe an update to that is causing the problem (the SLA worked fine for 3 years!)
Thanks. Grant.
01-22-2019 05:27 AM
You may fine tune your IP SLA parameters. Create another IPLSA monitor to test, like:
sla monitor 123 type echo protocol ipIcmpEcho 8.8.8.8 interface Eastlink num-packets 3 frequency 10
track 2 rtr 123 reachability
sla monitor schedule 123 life forever start-time now
then do "show sla monitor operational-state" to check if the result is better.
Your current IP SLA is sending 1 packet every 5 seconds and if it miss one packet it will failover, might be too sensitive.
01-22-2019 06:17 AM
Thanks, I'll give that a shot. Interesting that the default num-packets is 1. I've been searching for a clearer understanding on this parameter. If you set it higher does it only failover if all of the packets fail?
01-22-2019 06:36 AM
That is correct.
01-22-2019 06:40 AM
Good tip, I think I'll increase the num-packets to 3 on the current sla to opendns. Stay tuned! Thanks.
01-22-2019 09:32 AM
Hi. I have the logging in debug mode which means I can see the 609001 and 609002 syslog messages. The problem has happened again even with the num-packets set to 3. The following is what I'm seeing in the log (Eastlink is the primary link.) I believe the duration of 0:00:02 is the issue as noted in another article but pings connected directly to the router are 21 msec.
7 | Jan 22 2019 | 13:23:31 | 609001 | 208.67.222.222 | Built local-host Aliant:208.67.222.222 |
7 | Jan 22 2019 | 13:23:31 | 609002 | 208.67.222.222 | Teardown local-host Aliant:208.67.222.222 duration 0:00:00 |
7 | Jan 22 2019 | 13:23:31 | 609001 | 208.67.222.222 | Built local-host Aliant:208.67.222.222 |
7 | Jan 22 2019 | 13:23:29 | 609002 | 208.67.222.222 | Teardown local-host Eastlink:208.67.222.222 duration 0:00:02 |
Not sure what to look for next to see what is causing the delay.
Thanks. Grant.
01-22-2019 09:50 AM
Looks like your ASA has the default icmp timeout.
"timeout icmp hh:mm:ss—The idle time for ICMP, between 0:0:2 and 1193:0:0. The default is 2 seconds (0:0:2)"
Can you increase the icmp timeout to more then 5 seconds, like 10 or 30?
timeout icmp 00:00:10 for 10 seconds
timeout icmp 00:00:30 for 30 seconds
01-22-2019 10:21 AM
I set it to 5 and it still timed out. I just changed it to 15 and the timeout has stopped but it hasn't switched back to the primary. Is there a hidden timer somewhere which determines how long to wait before switching back?
Thanks. Grant.
01-22-2019 10:31 AM - edited 01-22-2019 10:37 AM
show sla monitor configuration 10
show sla monitor operational-state 10
you can also do some debug commands:
debug sla monitor trace
debug sla monitor error
Can you change the target to 8.8.8.8?
01-22-2019 10:48 AM
I changed back to 8.8.8.8 (opendns was a suggestion that was out there.)
Below is the results of those commands. I have also opened a support ticket with SourceFire as that is the only thing I can think of that would be doing automatic updates to the ASA.
Result of the command: "show sla monitor config 10"
IP SLA Monitor, Infrastructure Engine-II.
Entry number: 10
Owner:
Tag:
Type of operation to perform: echo
Target address: 8.8.8.8
Interface: Eastlink
Number of packets: 3
Request size (ARR data portion): 28
Operation timeout (milliseconds): 5000
Type Of Service parameters: 0x0
Verify data: No
Operation frequency (seconds): 5
Next Scheduled Start Time: Start Time already passed
Group Scheduled : FALSE
Life (seconds): Forever
Entry Ageout (seconds): never
Recurring (Starting Everyday): FALSE
Status of entry (SNMP RowStatus): Active
Enhanced History:
Result of the command: "show sla monitor operational 10"
Entry number: 10
Modification time: 14:42:05.794 AST Tue Jan 22 2019
Number of Octets Used by this Entry: 2056
Number of operations attempted: 24
Number of operations skipped: 24
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 14:45:55.795 AST Tue Jan 22 2019
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0 RTTMin: 0 RTTMax: 0
NumOfRTT: 0 RTTSum: 0 RTTSum2: 0
01-22-2019 11:26 AM
It is still timeout from your output. Just wondering if you do an ACL permit icmp to 8.8.8.8 solves the issue.
01-22-2019 11:48 AM
The part that makes troubleshooting this issue difficult is that it eventually comes back and was working fine for about 3 years. We didn't need that ACL before and it will come back shortly.
I get "ERROR: % Invalid input detected at '^' marker." for the "logging monitor debugging" command. Logging ? just gives me a savelog option.
Show debug shows me that TRACE and ERROR debugging for the SLA Monitor are On.
01-22-2019 11:27 AM
Nothing shows up when I do (from a ssh connection): debug sla monitor error and debug sla monitor trace
You would think I would see something...
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide