Problem with ASA active/standby set-up after migrating to new ISP circuits

mitchen · ‎10-30-2013

We have an Active/Standby ASA5540 firewall set-up with the Primary Active unit at our head office site (Site A) and the Secondary Standby unit at our DR site (Site B)

Both sites had their "outside" interfaces directly connected to our ISP (We connect the ASA outside interface to the provider's NTE at each site) This all seemed to work reasonably well - our active traffic would go through Site A and, in the event of a failure with Site A firewall or interface, comms would failover to Site B.

We recently decided to upgrade the bandwidth of our outside links to the ISP. This meant getting completely new circuits installed and new NTEs but we requested that we keep the same IP Addressing for the new circuits (we have a number of VPN connections so didn't want to have to be changing configuration)

So, come time to move to the new circuits, we presumed it would just be a case of changing the interface speed on the ASA interface (from 10 to 100) and moving the cables across from old NTE to new NTE. Meanwhile the ISP would activate the "new" ports on their network switch and shutdown the "old" ports. And this could be carried out relatively quickly to minimise any disruption.

However, this is not how it panned out. It seems that when the ISP activates the new ports, Site B takes over as Active firewall and the Site A firewall has its outside interface marked as "failed" - The ISP had to shutdown the Site B link in order to allow us to pass traffic through the Site A firewall and circuit again. And we are left with the situation where we effectively DON'T have our Active/Standby set-up with automatic failover any longer! We can either have Site A active and passing traffic and Site B marked as "failed" on its outside interface or vice versa.

I don't know too much about the ISP's set-up to be honest but, as far as I'm aware, the ISP connects both the circuits for Site A and Site B to the same network switch in their datacentre and to the same VLAN.

Can anyone suggest what the problem might be and how to resolve it? I'm assuming it has to be something at the ISP end since I don't really understand what else could be necessary from our point of view (i.e. what else would we need to do other than move the cables and configure the new interface speed)? Its as if there is some sort of conflict on the ISP's network switch - I don't know if it is something to do with the way the standby ASA takes over the active ASA IP and MAC address and that somehow gets the ISP network switch in a state of confusion?

Does anyone have any ideas/suggestions? Naturally we are a bit disappointed since we hoped this would be a relatively straightforward task to migrate to our new circuits with increased bandwidth!

Thanks.

barry · ‎10-31-2013

Hi

Where I've seen ASA interfaces, particularly outside ones, showing as "failed" is where they can't actually communicate with each other. I'm not sure if its ICMP that is required between then, but I've certainly seen similar issues where the two ASA outside cards can't ping between each other.

If you run a ping from ASA "B" to the outside address of ASA "A" does it work? I suspect not, and this is the route cause of your issue. If this is the case, then you'll need to get your ISP involved.

HTH

Barry Hesk
Intrinsic Network Solutions

barry · ‎10-31-2013

And just as another thought... here's a left field guess.

I reckon your old circuits were layer 2 tails (in the same VLAN) that terminate in your ISPs data centre, again on the same VLAN. This means that all devices in the same VLAN can always communicate with each other.

I reckon your new circuits are layer 3 tails, and only one will be routed over at any given time (the current active circuit). This would explain why the "standby" ASA - whichever one it is - always shows its outside card as failed.

Would explain the exact problem you are seeing.

As I say, bit left field, but I reckon there is logic there...

Barry Hesk
Intrinsic Network Solutions

mitchen · ‎10-31-2013

Thanks Barry - some very helpful suggestions, your 2nd one in particular definitely sounds like a strong possibility? Will try to find out more and will update and will let you know if we get any closer to resolving the issue or not...

mitchen · ‎11-05-2013

Well, a quick update. We still haven't got this working successfully.

The ISP have confirmed that the new circuits ARE layer 2 so seems that Barry's earlier suggestion (good though it was!) can't be the cause.

The ISP tried some manipulation of their switch(es) spanning tree set-up but to no avail - we can still only have one circuit active while the other one is marked as failed, Can't ping between the outside interfaces (allowed ICMP first so should have got a response if all was in order!)

I can't see how the issue can be anything other than a switching issue in the ISP's network but, so far, they are at a loss to explain what the problem could be and we are left without automatic failover of our new circuits. The ISP are going to continue to investigate offline but, if anyone has any suggestions or has seen similar in the past then further advice would certainly be appreciated. Thanks.

jumora · ‎11-05-2013

OK, look if your ISP did a change and it is not working and you are sure of this then why review the ASA. If the unit is at standby at this moment and the only interface that is affected is the outside then ISP ISP ISP.

Value our effort and rate the assistance!

mitchen · ‎11-06-2013

Well, I'm fairly certain the issue is with the ISP but

a) there is no harm (in fact, some might consider it good practise) to ensure all other bases are covered - just in case

b) it's entirely possible that someone out there in the vast Cisco networking world has come across the same sort of situations, particularly those who work for ISPs with similar customer set-up (or customers with this set-up who have had similar problems with their ISP!), and can give pointers as to how to resolve it - even if that is simply evidence to go back to beat up the ISP with. (Barry's suggestions above were very helpful indeed, for example, even if they may not ultimately have been the cause)

c) even if the problem is ultimately with the ISP, appreciating the dependencies etc can only help to gain a better understanding of the ASA devices themselves which is surely an aim of any technical forum?

jumora · ‎11-06-2013

I understand what you are saying and we are always happy to help but when the equipment that affects connectivity is not manageable that is where support forums or TAC case can't help. I would suggest calling the ISP and getting this escalated.

Value our effort and rate the assistance!

Jouni Forss · ‎11-06-2013

Hi,

You say that you have no connectivity between the ASAs "outside" interfaces? Does your ISP have HSRP doing the gateway redudancy on their side? Can they confirm its ok?

A very easy thing to confirm the complete connectivity would be to ping the "standby" IP address from the Active unit and then issue "show arp | inc outside" (or replace the outside with the actual name of your external interface)

If you can't see the "standby" IP address in the "show arp" output that means even the ARP isnt working between your sites. At this point it should be up to your ISP to check where the traffic stops.

If you can see the "standby" IP address in the Active units ARP then I am not sure what the problem is.

I think the Failover operation has its own "debug" command which is "debug fover" in addition to multiple different parameters. I am not sure how much output it generates but I would use the additional options after the "debug fover" if I were to use debug to help.

You should probably even be able to configure "capture" on your ASA before you do any checking. You could capture traffic between the primary and standby IP address of the interface and see if anything is actually happening. I guess you can even go as far to capture the ARP messages and see if there is anything visible.

- Jouni

mitchen · ‎11-07-2013

Hi Jouni, thanks for the good advice and suggestions - very much appreciated.

I don't know whether the ISP have HSRP doing the gateway/redundancy on their side but I don't think so. I can try to confirm but getting information out of them on their set-up is often difficult, although we continue to pursue them on this.

I am unable to ping the standby IP address from the Active unit (I allowed ICMP so the firewalls themselves were definitely not blocking it) Indeed I can't ping the Standby IP address from anywhere.

However, show arp | inc outside DOES show the "standby" IP address in the output so ARP seems to be working at least?

But if I can see the standby IP address in the Active unit's ARP table but can't seem to otherwise ping/communicate with the Standby unit over the outside interface then what could the problem be?

Jouni Forss · ‎11-07-2013

Hi,

Could you post the output of the below just to be sure

show run icmp

And perhaps also

show run access-group

I am actually not sure how the ASA does with the ARP in a Failover pair. I rarely have to troubleshoot Failover. For the most part they seem to work flawlesly in our Datacenter environments

The output of

show failover

Should list statistics at the bottom also related to ARP

I guess you could also take the following output from the Standby unit

show arp | inc outside

Again depending what your external interface is named. Just to make sure that the same information is shown there

You could actually run that command from the Active unit with this

failover exec mate show arp | inc outside

It should send this command to the Standby unit through the Failover link and print its output to the Active units CLI You can use that for other commands too if you want to do all from a single ASA unit.

You could also configure a capture on the Active firewall and perhaps even the Standby. The capture configured on the Active unit only applies to it. I dont think it captures the traffic on the Standby unit.

The capture configuration could be

access-list OUTSIDE-CAP permit ip host host

capture OUTSIDE-CAP type raw-data accesslist OUTSIDE-CAP interface outside buffer 1000000 circular-buffer

You should be able to configure the capture buffer to 33500000 also which is almost the maximum allowed if you want to run it for a long time. The "circular-buffer" and the end specifies that the ASA will overwrite old information IF the buffer is filled.

This could be done on both units if you want to make sure what traffic both see. You could then see the ICMP traffic. You would also catch the traffic that the ASA uses to monitor the Failover interface state. It uses protocol 105 (SCPS). This naturally requires that you are monitoring that interface. If I am not mistaken then a normal physical interface is monitored all the time but a logical interface requires the "monitor-interface" command.

To view if any traffic is captured you can use the command

show capture

To view the actual contents of the capture you can use the command

show capture OUTSIDE-CAP

Better yet, you could copy the capture to your computer with TFTP and open it with Wireshark to actually make sense of the output

copy /pcap capture:OUTSIDE-CAP tftp:///OUTSIDE-CAP.pcap

This is an actual capture output from one of our Failover pairs viewed with Wireshark (though it doesnt contain much and I removed the IPs as they are public naturally) You should see this between the monitored interface from both untis.

You can remove the capture (and its contents with) with the command

no capture OUTSIDE-CAP

The created ACL you have to delete separately ofcourse.

With the captures on both units you would atleast have the chance to confirm that no traffic is "dissapearing" between the ASAs on the external interface.

I am not sure how the ISP has configure the L2 segment between the ASAs and the L3 gateway(s). I guess you could ask them to make sure they can see both units MAC addresses all along the way.

I am not sure if any of this helps but some thoughts alteast what to look for

I have personally have an easier time solving these for our customers as I have access to both customer ASAs and the ISP core network.

- Jouni

mitchen · ‎11-07-2013

Hi Jouni,

some great suggestions and advice there, thanks very much (well worth 5 stars even if I still haven't solved my issue!)

I didn't know about the method to run commands on the standby unit from the active - very handy, thanks.

Some sample output:

sh run icmp

icmp unreachable rate-limit 1 burst-size 1

sh run access-group

access-group inbound in interface outside

access-group outbound in interface inside

("inbound" ACL also contains a permit icmp any any now but I'm not sure that is even needed)

sh failover ("real" IP addresses changed in output)

Failover On

Failover unit Primary

Failover LAN Interface: Failover GigabitEthernet0/1 (up)

Unit Poll frequency 1 seconds, holdtime 15 seconds

Interface Poll frequency 5 seconds, holdtime 25 seconds

Interface Policy 1

Monitored Interfaces 2 of 250 maximum

Version: Ours 7.2(5)10, Mate 7.2(5)10

Last Failover at: 18:57:46 GMT/BST Nov 5 2013

This host: Primary - Active

Active time: 17735733 (sec)

slot 0: ASA5540 hw/sw rev (1.1/7.2(5)10) status (Up Sys)

Interface outside (1.1.1.2): Normal (Waiting)

Interface DMZ (0.0.0.0): No Link (Not-Monitored)

Interface inside (192.168.20.5): Normal

Interface management (192.168.2.1): No Link (Not-Monitored)

slot 1: ASA-SSM-20 hw/sw rev (1.0/6.2(4)E4) status (Up/Up)

IPS, 6.2(4)E4, Up

Other host: Secondary - Failed

Active time: 589 (sec)

slot 0: ASA5540 hw/sw rev (1.1/7.2(5)10) status (Up Sys)

Interface outside (1.1.1.3): Failed (Waiting)

Interface DMZ (0.0.0.0): Normal (Not-Monitored)

Interface inside (192.168.20.6): Normal

Interface management (0.0.0.0): Normal (Not-Monitored)

slot 1: ASA-SSM-20 hw/sw rev (1.0/6.2(4)E4) status (Up/Up)

IPS, 6.2(4)E4, Up

Stateful Failover Logical Update Statistics

Link : Failover GigabitEthernet0/1 (up)

Stateful Obj xmit xerr rcv rerr

General 1248457245 1 2843085 22992

sys cmd 2366474 0 2366459 0

up time 0 0 0 0

RPC services 0 0 0 0

TCP conn 312664258 0 134902 9878

UDP conn 902624774 0 297382 13108

ARP tbl 271594 1 288 6

Xlate_Timeout 0 0 0 0

VPN IKE upd 889667 0 14158 0

VPN IPSEC upd 29640478 0 29896 0

VPN CTCP upd 0 0 0 0

VPN SDI upd 0 0 0 0

VPN DHCP upd 0 0 0 0

Logical Update Queue Information

Cur Max Total

Recv Q: 0 82 3652027

Xmit Q: 0 1024 12293902460

On Standby:

show arp | inc outside

outside 1.1.1.2 0018.73d6.19e5 2

outside 1.1.1.1 001a.e2e6.bdfa 295

(1.1.1.1 being the "default gateway" to the ISP)

Also, good suggestion to capture traffic between the interfaces (and the circular buffer was also something new for me - previously, I had just been letting the buffer fill up with my captures then clearing manually!)

Now, the captures show that both Active and Standby units send the SCPS (105) packets but no replies ever come back. Similarly, when I attempt the pings - the captures show the ICMP packets being sent but no replies coming back.

Active Unit capture ("real" public IP addresses changed)

1239: 17:14:09.502888 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1240: 17:14:14.502522 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1241: 17:14:19.501911 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1242: 17:14:19.749366 1.1.1.2 > 1.1.1.3: icmp: echo request

1243: 17:14:21.741706 1.1.1.2 > 1.1.1.3: icmp: echo request

1244: 17:14:23.741508 1.1.1.2 > 1.1.1.3: icmp: echo request

1245: 17:14:24.501408 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1246: 17:14:25.741416 1.1.1.2 > 1.1.1.3: icmp: echo request

1247: 17:14:27.741096 1.1.1.2 > 1.1.1.3: icmp: echo request

1248: 17:14:29.500981 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1249: 17:14:34.500447 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1250: 17:14:39.499958 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1251: 17:14:44.502659 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1252: 17:14:49.498997 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1253: 17:14:54.498494 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1254: 17:14:59.498005 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

1255: 17:15:04.497517 1.1.1.2 > 1.1.1.3: ip-proto-105, length 88

Standby Unit Capture ("real" public IP addresses changed)

1209: 17:16:51.784001 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1210: 17:16:56.783574 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1211: 17:17:01.783040 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1212: 17:17:06.782567 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1213: 17:17:08.073375 1.1.1.3 > 1.1.1.2: icmp: echo request

1214: 17:17:10.072292 1.1.1.3 > 1.1.1.2: icmp: echo request

1215: 17:17:11.782003 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1216: 17:17:12.072078 1.1.1.3 > 1.1.1.2: icmp: echo request

1217: 17:17:14.072063 1.1.1.3 > 1.1.1.2: icmp: echo request

1218: 17:17:16.071666 1.1.1.3 > 1.1.1.2: icmp: echo request

1219: 17:17:16.781530 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1220: 17:17:21.780980 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1221: 17:17:26.780507 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

1222: 17:17:31.779988 1.1.1.3 > 1.1.1.2: ip-proto-105, length 88

I guess this confirms what we know i.e. that comms between the units on the outside interfaces aren't working but still doesn't explain why?

I think it has to be something on the ISP's switching network - could it be something as simple as their STP set-up detects a loop condition and "blocks" the standby unit, for some reason? And if that is a possibility, why might it be happening?

Very puzzling?

Any more suggestions and advice would be most welcome!

Jouni Forss · ‎11-07-2013

Hi,

So if I am reading this correctly then your Active unit is showing all the external links expected ARP information BUT the Standby unit only shows its own and gateways ARP?

Also if you configured the capture ACL bidirectionally on both units (so that it captures sent and received information) then we can clearly see that no traffic from either unit gets to the other unit.

I would have to say that from my own perspective this is not something you should have to be tackling alone. The ISP should really help out troubleshooting the problem.

The information you already have gathered should already be pretty good material to show the ISP that the connection between the sites simply is not working. And considering that they arent really providing the service you are paying them for.

I am not sure what more can be done from the devices you manage. I atleast feel that its unreasonable for the ISP to expect you to solve/troubleshoot this alone.

If you want to capture ARP traffic on the "outside" interface you can probably use this command for capture.

capture ARP-CAPTURE ethernet-type arp interface outside

The other commands apply to this capture also. You can show it in the CLI and copy it to your computer.

- Jouni

mitchen · ‎11-07-2013

Hi Jouni,

I think the ARP info on both Active and Standby is as expected i.e. Active unit shows ARP entry for Standby unit and ISP default gateway. Standby unit shows ARP entry for Active unit and ISP default gateway. This is what they currently show and I'm assuming this is what should be expected (though not having gone into this level of detail on this side of things I'm not 100% sure?)

Captures definitely captured bidirectionally so confident that they show traffic from either unit not getting to the other.

I cleared the arp outside entries on the standby ASA and tried the ARP capture you suggested. Interestingly, the ARP table immediately shows the ARP entry for the active ASA (nothing got captured in my packet capture for it i.e. I didn't actually see any ARP requests go out?)

I then tried pinging the ISP's default gateway and the ARP capture shows ARP requests being sent for it but no replies? However, the ARP table does eventually show an entry for it, as before?

Again, I'm not too sure what i should be expecting to see here?

I definitely agree with you that I'm reaching the point of exhaustion on what I can do to look into the issue myself. In fairness, the ISP have said they are working on it but all they have really asked from me so far is for my ASA configs so I'm not sure they are looking in the right place as all the evidence would seem to point at the problem being with their set-up rather than with the ASAs?

Thanks again for all the assistance you have given me on this, it has been very useful and has helped me learn some more about the ASA interactions for one thing!

Jouni Forss · ‎11-07-2013

Hi,

I think I read your earlier post wrong. I was actually looking that your Standby ASA only had the its own and gateway information in the ARP table but it seems that both units have information about the other unit and gateway in the ARP.

I am not sure if this is the information transfered through the Failover link. This is 100% a guess on my part. Since we are not seeing any traffic reach the other unit on the external interface I would guess that the ARP information is the combined information what the units themselves see and "tell eachother" through the Failover link.

This would atleast tell why the ARP capture didnt show anything captured until you send ICMP to the gateway. But again, its just a guess. I would assume though if you see an ARP request in the capture you would need to see a reply for that.

- Jouni