Re: IP SLA Troubleshooting

gmacdonald11 · ‎01-21-2019

Hi folks. We have had IP SLA turned on for ISP failover for a few years now. We have been having issues lately where it fails over to the secondary ISP for several minutes/hours and then switches back to the primary. I have connected a workstation directly on the primary ISP and do not see any problems with the ping. I have also had the ISP vendor of the primary link in and they can't find any problems with their link/equipment. Firewall is an ASA 5515 and below is the SLA configuration. I have turned on message 622001 so I see when the tracked route goes up or down. My question is what is the best way to troubleshoot what is going on? (ie. syslog messages I should track, other parameters to turn on, etc.)

sla monitor 10
type echo protocol ipIcmpEcho 8.8.8.8 interface Eastlink
frequency 5

sla monitor schedule 10 life forever start-time now

track 1 rtr 10 reachability

route Eastlink 0.0.0.0 0.0.0.0 x.x.x.x 1 track 1
route Aliant 0.0.0.0 0.0.0.0 x.x.x.x 2

In the down state the 'show sla monitor operational-state' shows this:

Result of the command: "show sla monitor operational-state"

Entry number: 10
Modification time: 09:12:00.857 AST Mon Jan 21 2019
Number of Octets Used by this Entry: 2056
Number of operations attempted: 500
Number of operations skipped: 500
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 10:35:10.857 AST Mon Jan 21 2019
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0 RTTMin: 0 RTTMax: 0
NumOfRTT: 0 RTTSum: 0 RTTSum2: 0

Thanks. Grant.

superego · ‎01-22-2019

Hi Grant,

It might be Google doing icmp rate limit. Try using a different target like Eastlink DNS or other reliable IP address in the internet.

gmacdonald11 · ‎01-22-2019

Thanks, I have switched it to OpenDNS IP which I've heard is more reliable. I had tried the Eastlink DNS IP and the problem was still occurring. The odd part is the randomness of when it happens and how sometimes it switches for 10 minutes and other times 2-3 hours. I have SysLog setup to send traps for 609001, 609002 and 622001. I am seeing the 622001 but not the others. I just did a 'debug sla monitor error' which I believe will turn these on although I haven't seen them yet.

The other thing of note is that we use SourceFire. Maybe an update to that is causing the problem (the SLA worked fine for 3 years!)

Thanks. Grant.

superego · ‎01-22-2019

You may fine tune your IP SLA parameters. Create another IPLSA monitor to test, like:

sla monitor 123
 type echo protocol ipIcmpEcho 8.8.8.8 interface Eastlink
 num-packets 3
 frequency 10

track 2 rtr 123 reachability

sla monitor schedule 123 life forever start-time now

then do "show sla monitor operational-state" to check if the result is better.

Your current IP SLA is sending 1 packet every 5 seconds and if it miss one packet it will failover, might be too sensitive.

gmacdonald11 · ‎01-22-2019

Thanks, I'll give that a shot. Interesting that the default num-packets is 1. I've been searching for a clearer understanding on this parameter. If you set it higher does it only failover if all of the packets fail?

superego · ‎01-22-2019

That is correct.

gmacdonald11 · ‎01-22-2019

Good tip, I think I'll increase the num-packets to 3 on the current sla to opendns. Stay tuned! Thanks.

gmacdonald11 · ‎01-22-2019

Hi. I have the logging in debug mode which means I can see the 609001 and 609002 syslog messages. The problem has happened again even with the num-packets set to 3. The following is what I'm seeing in the log (Eastlink is the primary link.) I believe the duration of 0:00:02 is the issue as noted in another article but pings connected directly to the router are 21 msec.

7

Jan 22 2019

13:23:31

609001

208.67.222.222

Built local-host Aliant:208.67.222.222

7

Jan 22 2019

13:23:31

609002

208.67.222.222

Teardown local-host Aliant:208.67.222.222 duration 0:00:00

7

Jan 22 2019

13:23:31

609001

208.67.222.222

Built local-host Aliant:208.67.222.222

7

Jan 22 2019

13:23:29

609002

208.67.222.222

Teardown local-host Eastlink:208.67.222.222 duration 0:00:02

Not sure what to look for next to see what is causing the delay.

Thanks. Grant.

superego · ‎01-22-2019

Looks like your ASA has the default icmp timeout.

"timeout icmp hh:mm:ss—The idle time for ICMP, between 0:0:2 and 1193:0:0. The default is 2 seconds (0:0:2)"

https://www.cisco.com/c/en/us/td/docs/security/asa/asa93/configuration/firewall/asa-firewall-cli/conns-connlimits.pdf

Can you increase the icmp timeout to more then 5 seconds, like 10 or 30?

timeout icmp 00:00:10 for 10 seconds

timeout icmp 00:00:30 for 30 seconds

gmacdonald11 · ‎01-22-2019

I set it to 5 and it still timed out. I just changed it to 15 and the timeout has stopped but it hasn't switched back to the primary. Is there a hidden timer somewhere which determines how long to wait before switching back?

Thanks. Grant.

superego · ‎01-22-2019

show sla monitor configuration 10

show sla monitor operational-state 10

you can also do some debug commands:

debug sla monitor trace

debug sla monitor error

Can you change the target to 8.8.8.8?

gmacdonald11 · ‎01-22-2019

I changed back to 8.8.8.8 (opendns was a suggestion that was out there.)

Below is the results of those commands. I have also opened a support ticket with SourceFire as that is the only thing I can think of that would be doing automatic updates to the ASA.

Result of the command: "show sla monitor config 10"

IP SLA Monitor, Infrastructure Engine-II.
Entry number: 10
Owner:
Tag:
Type of operation to perform: echo
Target address: 8.8.8.8
Interface: Eastlink
Number of packets: 3
Request size (ARR data portion): 28
Operation timeout (milliseconds): 5000
Type Of Service parameters: 0x0
Verify data: No
Operation frequency (seconds): 5
Next Scheduled Start Time: Start Time already passed
Group Scheduled : FALSE
Life (seconds): Forever
Entry Ageout (seconds): never
Recurring (Starting Everyday): FALSE
Status of entry (SNMP RowStatus): Active
Enhanced History:

Result of the command: "show sla monitor operational 10"

Entry number: 10
Modification time: 14:42:05.794 AST Tue Jan 22 2019
Number of Octets Used by this Entry: 2056
Number of operations attempted: 24
Number of operations skipped: 24
Current seconds left in Life: Forever
Operational state of entry: Active
Last time this entry was reset: Never
Connection loss occurred: FALSE
Timeout occurred: TRUE
Over thresholds occurred: FALSE
Latest RTT (milliseconds): NoConnection/Busy/Timeout
Latest operation start time: 14:45:55.795 AST Tue Jan 22 2019
Latest operation return code: Timeout
RTT Values:
RTTAvg: 0 RTTMin: 0 RTTMax: 0
NumOfRTT: 0 RTTSum: 0 RTTSum2: 0

superego · ‎01-22-2019

It is still timeout from your output. Just wondering if you do an ACL permit icmp to 8.8.8.8 solves the issue.

gmacdonald11 · ‎01-22-2019

The part that makes troubleshooting this issue difficult is that it eventually comes back and was working fine for about 3 years. We didn't need that ACL before and it will come back shortly.

I get "ERROR: % Invalid input detected at '^' marker." for the "logging monitor debugging" command. Logging ? just gives me a savelog option.

Show debug shows me that TRACE and ERROR debugging for the SLA Monitor are On.

gmacdonald11 · ‎01-22-2019

Nothing shows up when I do (from a ssh connection): debug sla monitor error and debug sla monitor trace

You would think I would see something...