We have two user VPNs configured on our Cisco ASA 5585X firewall (running 9.2(3) in an HA cluster), that have worked for years. Users connect with a variety of VPN clients (some Mac, some old Cisco VPN clients, some AnyConnect), then typically ssh from their laptops into any number of Linux servers behind the firewall. The user-VPNs are configured with tunnel-groups with mostly default configs, and ACLs allowing them on the public-facing "OUTSIDE" interface.
A few weeks ago, users started to complain that their ssh sessions to servers over VPN were hanging randomly, 5-15 seconds, several times an hour. After a freeze ends, everything the user typed (usually a lot of annoyed <CR>s) are echo'd back. The VPN itself never drops-- just the ssh across the VPN hangs up momentarily.
Troubleshooting revealed the following about these "ssh freezes":
- A continuous ping from the VPN client to the server doesn't stop or freeze during an 'ssh freeze'.
- A continuous ping from the server to the VPN client doesn't stop or freeze during an 'ssh freeze' (however its *output* freezes momentarily if it's being monitored over the VPN, then recovers and shows no loss/latency).
- No packet loss or unusual latency appears in ICMP pings in either direction during an 'ssh freeze'.
- No packet loss appears in TCP/22 'pings' from VPN client to server (TCP/22 'ping' = nc -z <server> 22) (server->client not tested)
- No 'ssh freeze's occurs when the same servers are reached through a different interface on the same firewall (that is, a backdoor through a management interface -- not over VPN)
- 'ssh freeze' is also observed to the Nexus switch CLIs that are between the firewall, same VPN
-- but -
- 'ssh freeze' is NOT observed when ssh'ing to the ASA firewall CLI *itself*, same VPN
- 'ssh freeze' is also seen from VPN client to an AWS-based server reached over an L2L VPN connection (crypto map config to AWS) terminated on the same firewall -- so, never hitting any 'zone' or interface or ACL on the firewall, but still passing through the VPN and associated NAT-bypass functions.
All the basic network troubleshooting is covered :
-- No packet loss, frame loss, drops or overruns on any network interface in the path from client to server
-- Packet sniffer on the network interfaces to/from firewall show no traffic interruption or unusual activity
-- Internet connections show no interruption
-- ssh to servers through a different (non-VPN) interface on the same firewall never show freezes.
-- Monitoring the firewall's syslogs shows nothing unusual. User-VPNs aren't dropping or re-keyed. Syslog is set for all vpn-related class levels set to debug (e.g. "logging class vpn trap debugging")
Obvious questions are : what changed recently? Answer: Nothing that would "appear" to be related -- e.g., we added a new zone, and copied static NAT-bypass statements to let the VPN into the new zone without NAT'ing, e.g.:
nat (NEWZONE,OUTSIDE) source static any any destination static USERVPN-SUBNET USERVPN-SUBNET no-proxy-arp route-lookup
These types of statements already existed for other existing zones, and no problem.
This "freeze" issue is correlated in time with adding the new zone (and various other seemingly innocuous changes, cleaning up ACLs and such), but for the life of me, can't find any indication that a new zone is *causing* the freeze.
Users are also complaining about http/s, but these aren't reliable reports, and I haven't applied the same detailed troubleshooting to http/s yet. It doesn't exactly matter because ssh isn't working anyway and it's not clear that any new information would be garnered from setting up HTTP tests too. It's strange enough that ICMP doesn't blink while ssh/TCP/22 does.
I'd like to fail over our firewall cluster from primary/active so the secondary takes over as active, to give "state" a kick-in-the-pants, but we have a wiring issue to resolve first, and this is desperation tactics anyway.
Having ruled out everything (easy) in switching and routing, I suspect something related to statefulness and/or NAT on our firewall, but am out of ideas for what to troubleshoot or look for.
What "could" cause pings to show no loss or latency while ssh randomly hangs for 5-15 seconds over a client-VPN?
What more can we do to troubleshoot this? What am I missing?
It's a huge problem affecting many users, becoming more urgent every day that we can't figure it out. There are absolutely no symptoms other than "it appears to hang while I'm ssh'd to a server." Any hints, theories, speculation, questions welcome! Facts and experience even better!