cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
9579
Views
10
Helpful
18
Replies

IP SLA Troubleshooting

gmacdonald11
Level 1
Level 1

Hi folks.  We have had IP SLA turned on for ISP failover for a few years now.  We have been having issues lately where it fails over to the secondary ISP for several minutes/hours and then switches back to the primary.  I have connected a workstation directly on the primary ISP and do not see any problems with the ping.  I have also had the ISP vendor of the primary link in and they can't find any problems with their link/equipment.  Firewall is an ASA 5515 and below is the SLA configuration.  I have turned on message 622001 so I see when the tracked route goes up or down.  My question is what is the best way to troubleshoot what is going on?  (ie. syslog messages I should track, other parameters to turn on, etc.)  

 

sla monitor 10
type echo protocol ipIcmpEcho 8.8.8.8 interface Eastlink
frequency 5


sla monitor schedule 10 life forever start-time now

track 1 rtr 10 reachability

route Eastlink 0.0.0.0 0.0.0.0 x.x.x.x 1 track 1
route Aliant 0.0.0.0 0.0.0.0 x.x.x.x 2

 

In the down state the 'show sla monitor operational-state' shows this:

 

Result of the command: "show sla monitor operational-state"

Entry number: 10
Modification time: 09:12:00.857 AST Mon Jan 21 2019
Number of Octets Used by this Entry: 2056
Number of operations attempted: 500
Number of operations skipped: 500
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 10:35:10.857 AST Mon Jan 21 2019
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0 RTTMin: 0 RTTMax: 0
NumOfRTT: 0 RTTSum: 0 RTTSum2: 0

 

 

Thanks. Grant. 

18 Replies 18

superego
Level 1
Level 1

Hi Grant,

 

It might be Google doing icmp rate limit.  Try using a different target like Eastlink DNS or other reliable IP address in the internet.

Thanks, I have switched it to OpenDNS IP which I've heard is more reliable.  I had tried the Eastlink DNS IP and the problem was still occurring.  The odd part is the randomness of when it happens and how sometimes it switches for 10 minutes and other times 2-3 hours.  I have SysLog setup to send traps for 609001, 609002 and 622001. I am seeing the 622001 but not the others.  I just did a 'debug sla monitor error' which I believe will turn these on although I haven't seen them yet.

 

The other thing of note is that we use SourceFire.  Maybe an update to that is causing the problem (the SLA worked fine for 3 years!)

 

Thanks. Grant.

You may fine tune your IP SLA parameters.  Create another IPLSA monitor to test, like:

 

sla monitor 123
 type echo protocol ipIcmpEcho 8.8.8.8 interface Eastlink
 num-packets 3
 frequency 10

track 2 rtr 123 reachability

sla monitor schedule 123 life forever start-time now

 then do "show sla monitor operational-state" to check if the result is better.

 

Your current IP SLA is sending 1 packet every 5 seconds and if it miss one packet it will failover, might be too sensitive.

Thanks, I'll give that a shot.  Interesting that the default num-packets is 1.  I've been searching for a clearer understanding on this parameter.  If you set it higher does it only failover if all of the packets fail?

That is correct.

Good tip, I think I'll increase the num-packets to 3 on the current sla to opendns.  Stay tuned!  Thanks.

Hi.  I have the logging in debug mode which means I can see the 609001 and 609002 syslog messages.  The problem has happened again even with the num-packets set to 3.  The following is what I'm seeing in the log (Eastlink is the primary link.)  I believe the duration of 0:00:02 is the issue as noted in another article but pings connected directly to the router are 21 msec.

 

7 Jan 22 2019 13:23:31 609001 208.67.222.222       Built local-host Aliant:208.67.222.222
7 Jan 22 2019 13:23:31 609002 208.67.222.222       Teardown local-host Aliant:208.67.222.222 duration 0:00:00
7 Jan 22 2019 13:23:31 609001 208.67.222.222       Built local-host Aliant:208.67.222.222
7 Jan 22 2019 13:23:29 609002 208.67.222.222       Teardown local-host Eastlink:208.67.222.222 duration 0:00:02

 

Not sure what to look for next to see what is causing the delay.

Thanks. Grant.

Looks like your ASA has the default icmp timeout.

 

"timeout icmp hh:mm:ss—The idle time for ICMP, between 0:0:2 and 1193:0:0. The default is 2 seconds (0:0:2)"

https://www.cisco.com/c/en/us/td/docs/security/asa/asa93/configuration/firewall/asa-firewall-cli/conns-connlimits.pdf

 

Can you increase the icmp timeout to more then 5 seconds, like 10 or 30?

 

timeout icmp 00:00:10 for 10 seconds

timeout icmp 00:00:30 for 30 seconds

I set it to 5 and it still timed out.  I just changed it to 15 and the timeout has stopped but it hasn't switched back to the primary.  Is there a hidden timer somewhere which determines how long to wait before switching back?

 

Thanks. Grant.

show sla monitor configuration 10

 

show sla monitor operational-state 10

 

you can also do some debug commands:

debug sla monitor trace

debug sla monitor error

 

Can you change the target to 8.8.8.8?

 

I changed back to 8.8.8.8 (opendns was a suggestion that was out there.)

 

Below is the results of those commands.  I have also opened a support ticket with SourceFire as that is the only thing I can think of that would be doing automatic updates to the ASA.

 

Result of the command: "show sla monitor config 10"

IP SLA Monitor, Infrastructure Engine-II.
Entry number: 10
Owner:
Tag:
Type of operation to perform: echo
Target address: 8.8.8.8
Interface: Eastlink
Number of packets: 3
Request size (ARR data portion): 28
Operation timeout (milliseconds): 5000
Type Of Service parameters: 0x0
Verify data: No
Operation frequency (seconds): 5
Next Scheduled Start Time: Start Time already passed
Group Scheduled : FALSE
Life (seconds): Forever
Entry Ageout (seconds): never
Recurring (Starting Everyday): FALSE
Status of entry (SNMP RowStatus): Active
Enhanced History:

 

Result of the command: "show sla monitor operational 10"

Entry number: 10
Modification time: 14:42:05.794 AST Tue Jan 22 2019
Number of Octets Used by this Entry: 2056
Number of operations attempted: 24
Number of operations skipped: 24
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 14:45:55.795 AST Tue Jan 22 2019
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0 RTTMin: 0 RTTMax: 0
NumOfRTT: 0 RTTSum: 0 RTTSum2: 0

 

It is still timeout from your output.  Just wondering if you do an ACL permit icmp to 8.8.8.8 solves the issue.

The part that makes troubleshooting this issue difficult is that it eventually comes back and was working fine for about 3 years.  We didn't need that ACL before and it will come back shortly.

 

I get "ERROR: % Invalid input detected at '^' marker." for the "logging monitor debugging" command. Logging ? just gives me a savelog option.

 

Show debug shows me that TRACE and ERROR debugging for the SLA Monitor are On.

Nothing shows up when I do (from a ssh connection): debug sla monitor error and debug sla monitor trace

 

You would think I would see something...