SSH sessions over client-VPN freeze randomly

noemi.berry · ‎10-23-2015

Hard problem!

We have two user VPNs configured on our Cisco ASA 5585X firewall (running 9.2(3) in an HA cluster), that have worked for years. Users connect with a variety of VPN clients (some Mac, some old Cisco VPN clients, some AnyConnect), then typically ssh from their laptops into any number of Linux servers behind the firewall. The user-VPNs are configured with tunnel-groups with mostly default configs, and ACLs allowing them on the public-facing "OUTSIDE" interface.

A few weeks ago, users started to complain that their ssh sessions to servers over VPN were hanging randomly, 5-15 seconds, several times an hour. After a freeze ends, everything the user typed (usually a lot of annoyed <CR>s) are echo'd back. The VPN itself never drops-- just the ssh across the VPN hangs up momentarily.

Troubleshooting revealed the following about these "ssh freezes":

- A continuous ping from the VPN client to the server doesn't stop or freeze during an 'ssh freeze'.

- A continuous ping from the server to the VPN client doesn't stop or freeze during an 'ssh freeze' (however its *output* freezes momentarily if it's being monitored over the VPN, then recovers and shows no loss/latency).

- No packet loss or unusual latency appears in ICMP pings in either direction during an 'ssh freeze'.

- No packet loss appears in TCP/22 'pings' from VPN client to server (TCP/22 'ping' = nc -z <server> 22) (server->client not tested)

- No 'ssh freeze's occurs when the same servers are reached through a different interface on the same firewall (that is, a backdoor through a management interface -- not over VPN)

- 'ssh freeze' is also observed to the Nexus switch CLIs that are between the firewall, same VPN

-- but -

- 'ssh freeze' is NOT observed when ssh'ing to the ASA firewall CLI *itself*, same VPN

- 'ssh freeze' is also seen from VPN client to an AWS-based server reached over an L2L VPN connection (crypto map config to AWS) terminated on the same firewall -- so, never hitting any 'zone' or interface or ACL on the firewall, but still passing through the VPN and associated NAT-bypass functions.

All the basic network troubleshooting is covered :

-- No packet loss, frame loss, drops or overruns on any network interface in the path from client to server

-- Packet sniffer on the network interfaces to/from firewall show no traffic interruption or unusual activity

-- Internet connections show no interruption

-- ssh to servers through a different (non-VPN) interface on the same firewall never show freezes.

-- Monitoring the firewall's syslogs shows nothing unusual. User-VPNs aren't dropping or re-keyed. Syslog is set for all vpn-related class levels set to debug (e.g. "logging class vpn trap debugging")

Obvious questions are : what changed recently? Answer: Nothing that would "appear" to be related -- e.g., we added a new zone, and copied static NAT-bypass statements to let the VPN into the new zone without NAT'ing, e.g.:

nat (NEWZONE,OUTSIDE) source static any any destination static USERVPN-SUBNET USERVPN-SUBNET no-proxy-arp route-lookup

These types of statements already existed for other existing zones, and no problem.

This "freeze" issue is correlated in time with adding the new zone (and various other seemingly innocuous changes, cleaning up ACLs and such), but for the life of me, can't find any indication that a new zone is *causing* the freeze.

Users are also complaining about http/s, but these aren't reliable reports, and I haven't applied the same detailed troubleshooting to http/s yet. It doesn't exactly matter because ssh isn't working anyway and it's not clear that any new information would be garnered from setting up HTTP tests too. It's strange enough that ICMP doesn't blink while ssh/TCP/22 does.

I'd like to fail over our firewall cluster from primary/active so the secondary takes over as active, to give "state" a kick-in-the-pants, but we have a wiring issue to resolve first, and this is desperation tactics anyway.

Having ruled out everything (easy) in switching and routing, I suspect something related to statefulness and/or NAT on our firewall, but am out of ideas for what to troubleshoot or look for.

What "could" cause pings to show no loss or latency while ssh randomly hangs for 5-15 seconds over a client-VPN?

What more can we do to troubleshoot this? What am I missing?

It's a huge problem affecting many users, becoming more urgent every day that we can't figure it out. There are absolutely no symptoms other than "it appears to hang while I'm ssh'd to a server." Any hints, theories, speculation, questions welcome! Facts and experience even better!

thanks much!

noemi.berry · ‎12-14-2015

Just in case anyone reads this and is interested in the resolution --

-- This was an extremely hard problem because it had no symptoms. In the end, it was not a client-VPN problem, it had to do with any ssh that passed through our OUTSIDE interface.

Deepening the mystery, we failed over to our Secondary unit Active, and the ssh-hang problem disappeared. Failed back over the Primary / Active, and the ssh-hang problem re-appeared. That pointed to something stateful that wasn't getting copied or looked up right between the two firewall units (?), but we couldn't find anything looking at the NAT or xlate tables. TCP state-table bypass didn't help either. Other than different junk files lying around on disk0, we couldn't find anything different about the two units.

But finally, our failing over back and forth yielded a difference. When we went back to Secondary / Active, the Failover state was now "Failed," due to "sw" on a CX module. This hadn't shown up before we'd failed over to Secondary (ssh-hangs go away), back to Primary (ssh-hangs re-appear), then back to Secondary (because users were screaming).

asa-5585x-fw1/pri/stby# show failover

...

This host: Primary - Failed

...

slot 1: ASA5585-SSP-CXSC-A hw/sw rev (1.2/9.3.1.1) status (Up/Down)

ASA CX, 9.3.1.1, Up

Other host: Secondary - Active

...

slot 1: ASA5585-SSP-CXSC-A hw/sw rev (1.2/9.2.1.1) status (Up/Up)

ASA CX, 9.2.1.1, Up

Turns out, both firewalls had CX modules, from a past failed attempt at IDS, but everyone believed they'd been disabled. I'd just noticed from packet-trace that the flows passed through the CX module, but it was configured as "failover open" and had run that way for years, so it was one of those no-ops that we didn't think to look at.

From the failover state, we noticed that the software revision on the CX modules on each unit was slightly different -- newer (9.3.1.1) on the Primary unit that showed the ssh-hang problem, and older (9.2.1.1) on the Secondary unit. Called TAC, and they said that neither revision was compatible with our main code base of 9.2(3)!

So we removed the leftover service policy hat sent traffic through the CX module, and then disabled the hardware module on each unit, then failed back over to Primary / Active, and finally, after weeks of struggle, the ssh-hang problem was gone.

Why did we run for years with the CX modules without this problem showing up? Hard to say, but we'd recently added some new zones and NATs, and the state-table had gotten more complex, perhaps tickling a bug in the newer CX code.

Not easy to glean any troubleshooting lessons out of this. Next time I see one protocol acting strange while the others march on without any interruption, I won't spend as much time ruling out other problems in the network, and will go straight to the firewall. Also, will take a much closer look at Primary / Secondary differences. And be more skeptical about "oh that's been there forever it couldn't be the problem...."

fyi!