Re: ASA 5505 L2L VPN drops SSH, RDP, and SMB traffic

cisco_nmillc · ‎08-24-2012

Two ASA5505s with functioning IPsec L2L VPN. Well, almost functioning.

Most "connectionless" traffic, including ICMP traffic, DNS queries, DNS zone transfers, HTTP(S) and ASDM, works fine between the joined LANs.

SSH and FreeNX (X over SSH) sessions establish and run for 20-100 seconds before they are terminated. SSHing into a remote LAN host and pinging back to where I'm coming from on four successive attempts made it to ping 70, 83, 84, and 93 respectively before receiving a "connection lost" message. The connection can be immediately reestablished with no noticable delay or latency, but drops again after one to two minutes.

RDP wouldn't establish at all, but that's because a recent install of personal security software obliterated the previous rule permitting tcp/3389 incoming to the target host--so we'll just pretend I didn't mention it.

Informational SMB traffic (e.g. net view) takes a _long_ time to complete--several 60+ seconds (and this is not a DNS issue).

Most of the discussions I've read point toward fragmentation when a VPN tunnel comes up but traffic is unreliable. However, the SSH packets are tiny and don't appear to be triggering fragmentation. I've also tried sysopt connection tcpmss 1300 on both devices but to no avail.

Any suggestions?

Karsten Iwen · ‎08-24-2012

What are the corresponding log-messages on both ASAs?

Sent from Cisco Technical Support iPad App

cisco_nmillc · ‎08-25-2012

There are no relevant log messages. These are BUN-K9s (10 user limit), but the host limit has not been exceeded and no hosts have been denied.

There is some additional information: the VPN is more stable in one direction than the other. An SSH session from the branch office to the data center was up continuously for 8h55m.

user pts/4 Fri Aug 24 11:58 - 20:54 (08:55) host.branch.company.com

Sessions from the workstation segment of the data center to the branch survived no longer than 21m (and the connections all dropped on the client end after 1-4 minutes; the 16-21m duration of these connections is how long it took the server to realize the connection was dead):

user pts/2 Fri Aug 24 21:48 - 22:10 (00:21) host.datacenter.company.com

user pts/5 Fri Aug 24 21:26 - 21:43 (00:17) host.datacenter.company.com

user pts/4 Fri Aug 24 21:20 - 21:37 (00:17) host.datacenter.company.com

user pts/2 Fri Aug 24 21:17 - 21:34 (00:16) host.datacenter.company.com

user pts/2 Fri Aug 24 20:44 - 21:01 (00:17) host.datacenter.company.com

user pts/5 Fri Aug 24 19:31 - 19:48 (00:17) host.datacenter.company.com

Sessions from the server segment of the branch appear to survive indefinitely although I haven't proved that yet.

This would seem to imply an infrastructure issue on the workstation segment of the data center network. The workstation segment consists of an AP1041 and several wired workstations connected to some Catalyst 2924XL switches with an IBM x300 (2003 vintage!) server running OpenBSD 4.8 as a firewall.

Don't let this stop you from making suggestions. The information above is suggestive but inconclusive. Other information strongly suggests the problem is related to the VPN tunnel. For example, mapping the same hosts at either end to public addresses and connecting with SSH over the Internet is reliable. Only over the IPsec L2L VPN are SSH connections between the workstation datacenter segment and the branch office unreliable.

david.tran · ‎08-25-2012

Here are my suggestions:

1- if it is possible, snoop traffics on the internal interface on both ASA(es) for a particular ssh connection. do NOT use the stupid capture commands on the Pix. What you want to do is to span the port on the internal interface of the ASA and mirror that traffics to a Linux box with tcpdump and capture the traffics into a file so that you can analyze them in more details. That way, you can verify if the traffics can make it back and forth over the VPN tunnel for a particular session.

2- enable ssh keep-alive on the workstation that connection ssh connection to the server back to your branch. Set the keepalive to 5 seconds. If you run wireshark on the workstation in the data center, you will send a PUSH and ACK between the workstation and the ssh server in your branch office even when you're not typing anything on that ssh connection. That will maintain the connecvitity. The session should NOT timeout. If it does, then you can confirm the issue is with the VPN tunnel.

cisco_nmillc · ‎08-26-2012

David,

Thanks for the suggestions.

SSH keep-alive is enabled on all my SSH sessions, though the timeout is 30 seconds. The issue is not an SSH timeout. From my original post:

SSHing into a remote LAN host and pinging back to where I'm coming from on four successive attempts made it to ping 70, 83, 84, and 93 respectively before receiving a "connection lost" message.

Even if I wasn't sending traffic back to the source, the connection persists for 2+ times the current keep-alive interval.

It's possible to sniff, I was just hoping to avoid it if this was a problem easily answered by the CSC. Then again, if was easily answered I would have found the answer already. You're absolutely correct that using a tool external to the Cisco device is preferable to using the capture, though Cisco capture has served me well in the past.

As for Karsten's suggestion to increase the logging level:

SSH Connection initiated, resulting in ASDM messages (the two build messages are possibly interesting):

6

Aug 26 2012

07:29:09

10.2.0.200

60652

10.10.0.110

22

Built outbound TCP connection 35731 for outside:10.10.0.110/22 (10.10.0.110/22) to inside:10.2.0.200/60652 (10.2.0.200/60652)

6

Aug 26 2012

07:29:09

10.2.0.200

60653

10.10.0.110

22

Built outbound TCP connection 35732 for outside:10.10.0.110/22 (10.10.0.110/22) to inside:10.2.0.200/60653 (10.2.0.200/60653)

Connection drops within +/-90 seconds: NO ASDM messages relevant to connection.

After 360+/- seconds, NO ASDM messages relevant to the connection. In other words, the connection drops but as far as the ASA is concerned it's still alive. This is consistent with my earlier posting that the target of the SSH connection shows the connection being alive for 15-20 minutes whereas the client perceives a drop after 60-90 seconds.

Interestingly, I have apparently confirmed my previous observation that SSH connections from other DC segments to the same branch host survive indefinitely: one such session made it to 52990 pings with 0% loss before I stopped it (1 second per ping).

Karsten Iwen · ‎08-26-2012

the two build messages are possibly interesting

The teardown-messages had been more interesting as the teardown-reason is visible there (for example a RST from inside or so). But you answered what I wantted to see, if there is no teardown-message then the ASA is probably not the cause of the problem. Unless a capture an one inside interface shows an incoming packet that is not seen outgoing the other ASAs inside interface.

--
Don't stop after you've improved your network! Improve the world by lending money to the working poor:
http://www.kiva.org/invitedby/karsteni

Karsten Iwen · ‎08-26-2012

There are no relevant log messages.

Then you need to increase the Log-level. Every connection (setup and tear-down) creates a log-message.

As of Davids advice a capture would also help. In my oppinion, it's fine to do that on the ASA itself which is much easier then spanning to an external device. As the connection is established through the ASA, all needed information should be seen there.

--
Don't stop after you've improved your network! Improve the world by lending money to the working poor:
http://www.kiva.org/invitedby/karsteni

cisco_nmillc · ‎08-26-2012

PROBLEM (possibly) IDENTIFIED or "Windows strikes again"

Sessions from the server segment of the branch appear to survive indefinitely although I haven't proved that yet. [...] This would seem to imply an infrastructure issue on the workstation segment of the data center network.

SSH sessions initiated from other DC segments to the branch survive indefinitely (or at least for more than 12 hours with no packet loss).

Workstation DC segment is unique because there is a firewall (10.2.0.1) and the ASA (10.2.0.2). DHCP-supplied default gateway for hosts on this segment is 10.2.0.1, which in turn has a route to 10.10.0.0/24 via 10.2.0.2. This generates an ICMP REDIRECT on the first connection attempt to 10.10.0.110 (OBSERVED on non-Windows 10.2 host).

SSH Connection initiated, resulting in ASDM messages (the two build messages are possibly interesting)

This appears to be the issue when initiating SSH from a Windows box: The initial session request is passed through 10.2.0.1 but generates an ICMP REDIRECT. The Windows box then _duplicates_ the session request and sends it to 10.2.0.2. That is why there are two build messages for a single SSH session request.

Manually adding a route on the Windows box to 10.10.0.0/16 via 10.2.0.2 and starting an SSH session to 10.10.0.110 appears to work--the current SSH session has been established for over 10 minutes, more than 3x as long as any previous attempt.

I've attached two packet traces: one is for the SSH setup with the standard DHCP static route on the workstation, which generates ICMP redirects and causes duplicate connection setups with lifetimes of 60-90 seconds (blast-oxygen-20120826-redirect.pdf). The other is with a static route added on the Windows workstation pointing the ASA for the 10.10.0.0/24 network. Connections appear to be stable at least out to 2 hours (blast-oxygen-20120826-static-route.pdf).

I realize that this is _not_ an issue with the ASA, but I'm curious if anyone has any suggestions. It's been 14 years since I've done much bit-level network analysis, but this seems to be incorrect handling of the ICMP redirect. I recognize that the standard wisdom is that ICMP redirects should never be seen on a LAN segment, but that wisdom is flawed since it presumes there is only one egress route from a LAN. In this case there are quite correctly two egress routes from the LAN.

Message was edited by: Andrew Robinson

cisco_nmillc · ‎08-26-2012

PROBLEM RESOLVED

This problem was resolved through my own efforts and insights. Input from the CSC was appreciated but not helpful in this instance. Unfortunately, I can't mark my own solution as the correct answer--so maybe someone with such power can do so.

The solution is to add DHCP option 121 (classless route) to the options provided by the DHCP server. Under most real operating systems, this is done in /etc/dhcpd.conf as follows:

option option-121 18:0a:0a:00:0a:02:00:02;

The value of the option in this case is an encoded format specified in RFC 3442. More recent DHCPD servers may offer a mnemonic and a friendlier format:

- The first octet is the length of the prefix: '18'x = 24 bits

- The next three octets are the network to be routed: '0a0a00'x = 10.10.0

- The next four octets are the router: '0a020002'x = 10.

- Multiple routes can be concatenated one after other--ugly, but functional

The ASA is absolutely not at fault in this case. As usual, Windows' behavior is the source of all evil. The only relationship between this problem and the ASA was that the ASA represented a second egress point from a particular segment. Everything goes to 10.2.0.1 by default, but 10.2.0.2 is the ASA which in turn provides the tunnel to the target network 10.10.0.0/24. You don't have to agree with this configuration, you just have to accept that it is both valid and necessary in this case.

david.tran · ‎08-26-2012

The problem could have been resolved if you turn off icmp redirect on the OpenBSD firewall, something like "net.ipv4.conf.all.accept_redirects = 0"

That way, the openBSD will do routing for the workstation instead of redirect

cisco_nmillc · ‎08-26-2012

David: That would work, but as all non-Windows hosts respond correctly to the ICMP REDIRECT (either by ignoring it or respecting it), I see no reason to reward Windows' broken behavior.

For the sake of posterity it is not OpenBSD accepting redirects that triggers the problem, but generating them (which a device configured as a router is supposed to do). The correct commands to inhibit ICMP REDIRECTs on OpenBSD would be:

sysctl net.inet.ip.redirect=0

sysctl net.inet6.ip6.redirect=0

Newer OpenBSD kernels by default do not accept redirects, but the commands to turn this off explicitly would be:

sysctl net.inet.icmp.rediraccept=0

sysctl net.inet6.icmp6.rediraccept=0

Obviously permanent changes go in /etc/sysctl.conf.

david.tran · ‎08-26-2012

In your original post, you stated: server running OpenBSD 4.8 as a firewall.

Now you're saying: For the sake of posterity it is not OpenBSD accepting redirects that triggers the problem, but generating them (which a device configured as a router is supposed to do).

See the contradictionary. Firewall is not as same as router

The firewall should have icmp redirect disable. Router, IMHO, should have icmp redirect disable as well, unless you're using WCCP, to avoid situation like you had originally.

cisco_nmillc · ‎08-26-2012

I apologize if my terminology was not precise.

As for how to handle ICMP REDIRECTS, the tendency to want to disable ICMP REDIRECT is the result of too many people not understanding why it was considered a security problem in the first place. ICMP REDIRECT is not a credible security threat on a properly configured internal network.

ICMP REDIRECT should most certainly be generated by routers, and it should most certainly be handled properly by client workstations.

The correct handling by the router is to see a local network device that can better handle the traffic, send a redirect, and forward the original packet to the better device.

The correct handling by the workstation is to either ignore the redirect (continuing to send packets to the original router will work, it just generates lots of unnecessary redirects), or to honor the redirect by updating the routing table but _not_ otherwise altering the protocol flow.

But we're getting far afield--the problem is resolved. Your solution will work, but I think it's no more and possibly less correct than my solution. Unless you want to have the last word, I appreciate all the help and feedback and this discussion is finis! ;-)