IPsec L2L tunnel hangs until cleared, has second SA

noemi.berry · ‎05-29-2015

We have a Cisco 5585X ASA firewall running 9.2(3), with L2L IPsec tunnels to 3 AWS sites.

We're having problems with one of the tunnels "hanging" -- that is, data stops flowing until the SA peer is cleared . Detailed troubleshooting has brought up more questions than it's answered.

Two out of our three tunnels show two SAs, with the 2nd SA being between the public-IP peers. Our 'hanging' tunnel to AWS Virginia is one of these (the other tunnel with the dual-SAs isn't used much so I'll leave it out). One tunnel to AWS Oregon never hangs, and it only shows one SA, to the private subnets behind the tunnels. On all the tunnels, the SA to the private subnets in AWS all look healthy -- even the encaps/decaps counters increment on this SA on the 'hanging' tunnel.

Why would 2 out of 3 tunnels form a 2nd SA between the public peers? Also, this 2nd SA refers to a "temp" access-list with "OO_temp" prepended to the name, that isn't in the configuration? The 3rd tunnel doesn't do this.

The crypto-map configs are all the same for the 3 tunnels, minus the IP specifics (subnets, peers etc). Each refers to its own "match address" ACL that has only one line, and transform-sets all the same. Yet 2 out of those 3 add a 2nd SA to an auto-generated ACL, and one of those tunnels is the hanging one.

A "clear ipsec sa peer" command fixes everything -- that is, the remote private subnet on AWS can be reached -- until it randomly hangs again, at an interval of about 3-10 days, though it's hard to say exactly.

We've heard anecdotally that IPSec tunnels between ASAs and AWS are problematic, but everything I can find is about establishing the tunnel to begin with, not finding it hanging -- nor a mystery 2nd SA.

Any explanation or help troubleshooting, or pointing me up the right tree to bark, would be greatly appreciated!

Cleaned-up config. The AWS peer is 1.1.1.1, my end is 4.4.4.4, the private subnet on AWS is 10.0.222.0.

crypto map MY-CRYPTO-MAP 201 match address acl-aws-va

crypto map MY-CRYPTO-MAP 201 set pfs

crypto map MY-CRYPTO-MAP 201 set connection-type originate-only

crypto map MY-CRYPTO-MAP 201 set peer 1.1.1.1 2.2.2.2

crypto map MY-CRYPTO-MAP 201 set ikev1 transform-set transform-aws

!

access-list acl-aws-va extended permit ip any 10.0.222.0 255.255.255.0

!

crypto ipsec ikev1 transform-set transform-aws esp-aes esp-sha-hmac

!

! My OUTSIDE interface is 4.4.4.4

!

interface GigabitEthernet0/0

nameif OUTSIDE

ip address 4.4.4.4 255.255.255.0

The 2nd SA on the problem-child tunnel has an access-list created by the ASA (apologies if I'm not using the term SA correctly here):

asa-5585x/pri/act# show crypto ipsec sa peer 1.1.1.1 | in access

access-list acl-aws-va extended permit ip any 10.0.222.0 255.255.255.0

access-list OO_temp_MY-CRYPTO-MAP201 extended permit ip host 4.4.4.4 host 1.1.1.1 <-- what is this?

The 2nd SA with the "OO_temp" access list shows no encaps, but the encaps/decaps counters for the SA to the private subnet were creeping up a little during the "hang":

asa-5585x/pri/act# show crypto ipsec sa peer 1.1.1.1 | in pkts

#pkts encaps: 193262, #pkts encrypt: 193262, #pkts digest: 193262

#pkts decaps: 90266, #pkts decrypt: 90266, #pkts verify: 90266

#pkts compressed: 0, #pkts decompressed: 0

#pkts not compressed: 193262, #pkts comp failed: 0, #pkts decomp failed: 0

#pkts encaps: 0, #pkts encrypt: 0, #pkts digest: 0 <--------- this is the SA between the public IPs

#pkts decaps: 221158, #pkts decrypt: 0, #pkts verify: 0

#pkts compressed: 0, #pkts decompressed: 0

#pkts not compressed: 0, #pkts comp failed: 0, #pkts decomp failed: 0

A "clear" command fixed the problem right away. The "2nd SA" between public-peers was there for a few moments after the clear, then wasn't, still isn't, and appears to be gone for now.

The complete output:

Hung, can't get to 10.0.222.0:

asa-5585x/pri/act# show crypto ipsec sa peer 1.1.1.1

peer address: 1.1.1.1

Crypto map tag: MY-CRYPTO-MAP, seq num: 201, local addr: 4.4.4.4

access-list acl-aws-va extended permit ip any 10.0.222.0 255.255.255.0

local ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)

remote ident (addr/mask/prot/port): (10.0.222.0/255.255.255.0/0/0)

current_peer: 1.1.1.1

#pkts encaps: 159284, #pkts encrypt: 159284, #pkts digest: 159284 <--------- these moved a little during

#pkts decaps: 73759, #pkts decrypt: 73759, #pkts verify: 73759 <--------- the hanging, not much

#pkts compressed: 0, #pkts decompressed: 0

#pkts not compressed: 159284, #pkts comp failed: 0, #pkts decomp failed: 0

#pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0

#PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0

#TFC rcvd: 0, #TFC sent: 0

#Valid ICMP Errors rcvd: 0, #Invalid ICMP Errors rcvd: 0

#send errors: 0, #recv errors: 0

local crypto endpt.: 4.4.4.4/0, remote crypto endpt.: 1.1.1.1/0

path mtu 1500, ipsec overhead 74(44), media mtu 1500

PMTU time remaining (sec): 0, DF policy: clear-df

ICMP error validation: disabled, TFC packets: disabled

current outbound spi: 2ECAE6B9

current inbound spi : 9159EA8E

inbound esp sas:

spi: 0x9159EA8E (2438589070)

transform: esp-aes esp-sha-hmac no compression

in use settings ={L2L, Tunnel, PFS Group 2, IKEv1, }

slot: 0, conn_id: 166686720, crypto-map: MY-CRYPTO-MAP

sa timing: remaining key lifetime (kB/sec): (4373242/987)

IV size: 16 bytes

replay detection support: Y

Anti replay bitmap:

0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF

outbound esp sas:

spi: 0x2ECAE6B9 (785049273)

transform: esp-aes esp-sha-hmac no compression

in use settings ={L2L, Tunnel, PFS Group 2, IKEv1, }

slot: 0, conn_id: 166686720, crypto-map: MY-CRYPTO-MAP

sa timing: remaining key lifetime (kB/sec): (4339138/987)

IV size: 16 bytes

replay detection support: Y

Anti replay bitmap:

0x00000000 0x00000000 0x00000000 0x00000001

Crypto map tag: MY-CRYPTO-MAP, seq num: 201, local addr: 4.4.4.4 <-------- What is this "2nd SA" ??

access-list OO_temp_MY-CRYPTO-MAP201 extended permit ip host 4.4.4.4 host 1.1.1.1 <--- why is this here?

local ident (addr/mask/prot/port): (4.4.4.4/255.255.255.255/0/0)

remote ident (addr/mask/prot/port): (1.1.1.1/255.255.255.255/0/0)

current_peer: 1.1.1.1

#pkts encaps: 0, #pkts encrypt: 0, #pkts digest: 0 <--------- never good, but this whole thing

#pkts decaps: 221158, #pkts decrypt: 0, #pkts verify: 0 shouldn't be here anyway?

#pkts compressed: 0, #pkts decompressed: 0

#pkts not compressed: 0, #pkts comp failed: 0, #pkts decomp failed: 0

#pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0

#PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0

#TFC rcvd: 0, #TFC sent: 0

#Valid ICMP Errors rcvd: 0, #Invalid ICMP Errors rcvd: 0

#send errors: 0, #recv errors: 205980

local crypto endpt.: 4.4.4.4/0, remote crypto endpt.: 1.1.1.1/0

path mtu 1500, ipsec overhead 74(44), media mtu 1500

PMTU time remaining (sec): 0, DF policy: clear-df

ICMP error validation: disabled, TFC packets: disabled

current outbound spi: 30D63B57

current inbound spi : 9F8D3D0A

inbound esp sas:

spi: 0x9F8D3D0A (2676833546)

transform: esp-aes esp-sha-hmac no compression

in use settings ={L2L, Tunnel, PFS Group 2, IKEv1, }

slot: 0, conn_id: 166686720, crypto-map: MY-CRYPTO-MAP

sa timing: remaining key lifetime (kB/sec): (4374000/1723)

IV size: 16 bytes

replay detection support: Y

Anti replay bitmap:

0x00000000 0x00000000 0x00000000 0x00000001

outbound esp sas:

spi: 0x30D63B57 (819346263)

transform: esp-aes esp-sha-hmac no compression

in use settings ={L2L, Tunnel, PFS Group 2, IKEv1, }

slot: 0, conn_id: 166686720, crypto-map: MY-CRYPTO-MAP

sa timing: remaining key lifetime (kB/sec): (4374000/1723)

IV size: 16 bytes

replay detection support: Y

Anti replay bitmap:

0x00000000 0x00000000 0x00000000 0x00000001

AFTER CLEAR. Now we can ping private addresses in 10.0.222.0, in our AWS VPC, and the "2nd SA" is gone.

asa-5585x/pri/act# clear ipsec sa peer 1.1.1.1

asa-5585x/pri/act# show crypto ipsec sa peer 1.1.1.1

peer address: 1.1.1.1

Crypto map tag: MY-CRYPTO-MAP, seq num: 201, local addr: 4.4.4.4

access-list acl-aws-va extended permit ip any 10.0.222.0 255.255.255.0

local ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)

remote ident (addr/mask/prot/port): (10.0.222.0/255.255.255.0/0/0)

current_peer: 1.1.1.1

#pkts encaps: 1258, #pkts encrypt: 1258, #pkts digest: 1258

#pkts decaps: 1298, #pkts decrypt: 1298, #pkts verify: 1298

#pkts compressed: 0, #pkts decompressed: 0

#pkts not compressed: 1260, #pkts comp failed: 0, #pkts decomp failed: 0

#pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0

#PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0

#TFC rcvd: 0, #TFC sent: 0

#Valid ICMP Errors rcvd: 0, #Invalid ICMP Errors rcvd: 0

#send errors: 0, #recv errors: 0

local crypto endpt.: 4.4.4.4/0, remote crypto endpt.: 1.1.1.1/0

path mtu 1500, ipsec overhead 74(44), media mtu 1500

PMTU time remaining (sec): 0, DF policy: clear-df

ICMP error validation: disabled, TFC packets: disabled

current outbound spi: 1744F38B

current inbound spi : E8159389

inbound esp sas:

spi: 0xE8159389 (3893728137)

transform: esp-aes esp-sha-hmac no compression

in use settings ={L2L, Tunnel, PFS Group 2, IKEv1, }

slot: 0, conn_id: 168267776, crypto-map: MY-CRYPTO-MAP

sa timing: remaining key lifetime (kB/sec): (4373447/3384)

IV size: 16 bytes

replay detection support: Y

Anti replay bitmap:

0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF

outbound esp sas:

spi: 0x1744F38B (390394763)

transform: esp-aes esp-sha-hmac no compression

in use settings ={L2L, Tunnel, PFS Group 2, IKEv1, }

slot: 0, conn_id: 168267776, crypto-map: MY-CRYPTO-MAP

sa timing: remaining key lifetime (kB/sec): (4373804/3383)

IV size: 16 bytes

replay detection support: Y

Anti replay bitmap:

0x00000000 0x00000000 0x00000000 0x00000001

<----------------------------------------------------- That's it, no access-list OO_temp...

asa-5585x/pri/act#

(The 2nd SA actually was there briefly after the clear, but disappeared quickly and hasn't been back.)

I don't even know if this is related, but our other tunnel to AWS Oregon is configured exactly the same way, and has never shown that 2nd SA with the access-list OO_temp. AWS information is virtually nil, all it says is "State: UP."

Many thanks for any guidance!

Andres Villarroel · ‎05-29-2015

Hello Noemi,

Thank you for the interesting post, I have never seen this OO_temp SA but I will take a look of it and let you know if I find any information.

In the meantime, I believe that you should setup a syslog server and keep and eye on this tunnel. If there is a phase 2 negotiation, you should be able to see who is initiating this second SA.

noemi.berry · ‎05-29-2015

We do have a syslog server, but there's so much that goes into it, I'm not sure what to look for (and honestly haven't combed through it), nor how to identify what's abnormal.

I've been watching what goes into syslog via ASDM to get a "feel" for what goes by -- and so I see that SAs are re-negotiated every so often as normal practice (surely a regular interval) .

Dumb question: Is it possible to tell from the "show crypto ipsec sa peer" output who initiated the 2nd SA ? Or only from syslog ? Any idea what to look for in syslog?

I've heard anecdotally many times that AWS doesn't do well with multiple SAs per peer -- and it seems that each <acl> line in the "crypto map <map-name> match address <acl>" config turns into a separate -- multiple -- SA per peer. We have another L2L to another of our Cisco ASAs, and that L2L matches address to an <acl> with 4 lines for 4 separate subnets and seems to have 4 SAs -- fine. Seems each separate ACL line/subnet gets its own SA. That one works fine, but it's ASA-ASA. It's our ASA-AWS tunnels with multiple SAs that seem to have problems.

Apparently Cisco said that all our IPsec tunnels to AWS had to be originated from our side -- so all our crypto maps are "connection-type originate-only" -- but now you're making me wonder if there isn't a setting on the AWS side to match this that we're missing.

Please, keep the ideas coming, and I will post the hard-won resolution when it happens!

Andres Villarroel · ‎05-30-2015

Hello Noemi,

I understand and it's true that the syslog server can be overwhelming; however, you can configure some event class filters and collect only the information that you are looking for. The following document can guide you through this configuration and remember that for VPN we will need logs with the highest level (debugging).

http://www.cisco.com/c/en/us/support/docs/security/pix-500-series-security-appliances/63884-config-asa-00.html#anc14

Now, with the syslog server configured you need to look for everything related to VPN with the peer 1.1.1.1. You could do this as soon as you find out if your VPN is stuck for example. In order to understand these logs you need to understand how the IKE and IPsec exchange works to determine what is normal and abnormal. Here is a good document about this message exchange.

http://www.cisco.com/c/en/us/support/docs/security/asa-5500-x-series-next-generation-firewalls/113574-tg-asa-ipsec-ike-debugs-main-00.htm l

Regarding your question if you can see in the "show cry ipsec sa" who initiated phase 2. This is not possible. You can see who initiated "phase 1" with the "show cry isa sa" but with the "originate-only" command in the crypto map, the ASA should be the only initiator for phase 1.

AWS doesn't do well with multiple SAs and you can find this information in Amazon's documents. They only accept an "any" to their subnet or a single subnet to their subnet; however, I've only seen it work with "any" to their subnet.

If you do not wish to use the "any" source, you must use a single access-list entry for accessing the VPC range.
! If you specify more than one entry for this ACL without using "any" as the source, the VPN will function erratically.
! The any rule is also used so the security association will include the ASA outside interface where the SLA monitor
! traffic will be sourced from.

http://docs.aws.amazon.com/AmazonVPC/latest/NetworkAdminGuide/Cisco_ASA.html

As I mentioned before, the "originate-only" only works for phase 1. Phase 2 would be bidirectional and both sites could initiate an SA. I have a question for you, do you need your end 4.4.4.4 to be able to communicate with AWS subnet? If not, would you consider a VPN-filter only for the needed traffic?

noemi.berry · ‎05-31-2015

>...do you need your end 4.4.4.4 to be able to communicate with AWS subnet?

I don't think so -- that entire construct is a mystery at all.

Our other tunnel to AWS-Oregon contains an SA only for the private subnets behind the tunnel -- no SA between the tunnel-endpoints themselves. No idea why a 2nd SA to AWS-VA is created between the public-IP peer endpoints themselves. Bug or feature?

Before looking into detail into a VPN filter to prevent this unwanted behavior, we need to know where it's coming from. What is different between our (stable) peer to AWS in Oregon, with only one SA ever observed, versus our (unstable) peer to AWS in Virginia, with two SAs observed, including a mystery one between the tunnel-endpoints public-IP peers themselves?

Clarities you've brought up: A) The unknown SA occurs only in Phase 2; B) it's impossible to tell which side initiated; C) The auto-generated 'OO_temp' prefix in the ACL that the 2nd SA refers to, isn't a commonly known element, and could be either the ASA responding to a phase-2 erratically generated by th AWS side, or the ASA itself erratically initiating a phase-2 to AWS.

(Whoever wrote the code generating the 'OO_temp'-named ACL probably knows , but they're long gone!)

Lots of time trolling / hacking syslogs is due -- and this might still not answer our broader objective of "why is this tunnel appearing to hang" -- so any other brainstorms that come up, please inform! A call to Cisco TAC is likely due as well but it can take hours to get to the support-level that can deal with this, I think?

thanks so much,

noemi

Andres Villarroel · ‎06-01-2015

Hello Noemi,

It's possible to determine who is initiating the unwanted phase 2 in the logs but you will need to follow the IKE packets to accurate do this.

I did further research and I found the following defects that we could use as a reference for your current issue CSCse30102 and CSCse18005. An easy way to check if you are having this known defect, do a "show run all | inc OO_temp". If you get a result, try following workaround described in CSCse18005.

I hope this help.

noemi.berry · ‎06-01-2015

Wow, this is interesting. This bug you found (CSCse18005) is almost what's going on, with two key differences: One, our dynamically created ACL appears only in the "show ipsec sa peer" output, not in the running-config.

asa-5585x/pri/act# show run all | in OO_temp

asa-5585x/pri/act#

asa-5585x/pri/act# show ipsec sa | in OO_temp

access-list OO_temp_MY-CRYPTO-MAP201 extended permit ip host 4.4.4.4 host 1.1.1.1

asa-5585x/pri/act#

And Two, apparently this dynamic ACL creation, with the a "host SA between IPSEC peers" is a feature, associated with "originate-only". Our tunnels aren't consistent though: One never has a problem and never shows the 'host SA between peers'. One sometimes has a problem and sometimes shows this SA. And the 3rd almost always has that 2nd/host/peer SA and just seems to flap a lot, but we don't use it much so I'm not sure how much stock to put into that.

So.... feature not a bug. So maybe the existence of this peer-SA is a red herring, and maybe in normal operations its counters aren't supposed to move. Any idea what it's for, why it's there, why we have a stable originate-only tunnel that doesn't show this peer-SA?

I see some other things too; we also have clientSSL VPNs that are unhappy about "any any" rules. And some disturbing messages about QM FSM errors, and a LAN-to-LAN being rejected ..... hmm, on our one (stable) tunnel without the peer-SA. More digging needed, keep the ideas coming !

Andres Villarroel · ‎06-01-2015

Noemi,

What happen if you take the "connection-type originate-only" configuration from your crypto map? does the tunnel keeps failing?

I'm not quite sure what is the reason behind this temprary SA but I will try to dig more into this. In the meantime, if you are willing to take the originate-only configuration, it would be a good test and take a look if the issue persist.

noemi.berry · ‎06-01-2015

>>What happen if you take the "connection-type originate-only" configuration from your crypto map? does the tunnel keeps failing?

Unfortunately I don't have this testing luxury on a live system. Before my time, the tunnels didn't work at all *until* they were configured as "originate-only" , at Cisco's insistence. As I understand it, this is a required config parameter for L2L tunnels between ASAs and AWS.

Also, our tunnels "not working" isn't reproducible, we learn about it from customer complaints. It's also hard to characterize. I can't tell from "show ipsec sa peer" or other commands that the "remote proxy" subnet isn't reachable -- the SAs are established, encaps/decaps counters creep up, the Duration of the tunnel increments. I'm sure there's a smoking-gun in there but I don't see it yet.

If "originate-only" creates host-SAs between the IPSec peers (inbound and outbound) , so be it, but I don't like this business of some tunnels have the host-SAs some of the time, or not at all. Is it needed or isn't it?

Definitely need syslog filters to direct just VPN events to a separate syslog too.

once again, thank you very much for any and all information!

slicerpro · ‎06-02-2015

What is the rationale for using 'any' for your local subnet(s) in the acl? Is it because you have several of them? If so, I would create a single network Object for them all and reference that object in the acl instead of 'any'

noemi.berry · ‎06-03-2015

Two answers to rationale: 1) None; 2) History (a consultant set it up). That source 'any' definitely needs to change.

Question: If we create an object with multiple subnets, and refer to it in one ACL in the crypto map, how many SAs should we expect to see?

Andres Villarroel · ‎06-06-2015

Each object combination is going to create an ACL line meaning a SA.

The "any" source is common for VPN tunnels with AWS. In Amazon's documentation, I don't see that "originate-only" is a requirement and according to what we have seen so far, this "OO_temp" SA is generated when using this feature. Let us know if taking this originate_only fixes your issue.

noemi.berry · ‎06-08-2015

>Each object combination is going to create an ACL line meaning a SA

Object "combination..." -- I think you mean that each item in an object will create its own ACL line which gets its own SA. An "object" is just a handy configuration construct that allows for convenient grouping and reference, but in the end really comes down to separate subnets or hosts defined in the object, each of which get their own ACL line and SA?

>>The "any" source is common for VPN tunnels with AWS. In Amazon's documentation, I don't see that "originate-only" is a requirement and according to what we have seen so far, this "OO_temp" SA is generated when using this feature. Let us know if taking this originate_only fixes your issue

Unfortunately, we were instructed by Cisco to use originate-only, and it will not be easy to challenge that. What bothers me is that the peer-SA. If it's part of what "originate only" needs , then why isn't it always there? And what is it needed for? If its decaps counter stops moving, is this a problem? And worse, is it the cause of our tunnels no longer passing data on the subnet-SAs?

And why DID Cisco recommend originate-only? (this was before my time and the consultants who told us that are long gone).

I'll post with updates if any grand insights come about, especially if someone else finds themselves having this problem.