This week one of our site-to-site IPSEC VPN's randomly dropped with no clear cause.
When troubleshooting I verified all the config was correct, re-created the VPN and debugged the life out of it, but could not fix it.
What I did notice through debug crypto isakmp was that the problem was on one of the 2 firewalls only.
Effectively that firewall (lets call it FW1) had just stopped listening to the ISAKMP messages, nothing was even logged.
The working firewall (FW2) would log that it had received the initiation request and respond accordingly, but FW1 never logged that it had received any requests.
I tested this theory by deleting the VPN config on FW2 so that FW1 would send the initiation and FW2 happily logged to debug it was receiving messages. When I reversed this and deleted the config on FW1 so that FW2 would send the initiation, FW1 had nothing in debug.
In the end I restarted FW1 and immediately the VPN resumed. No settings were changed and suddenly debug was full of messages again.
Can it occur that the VPN subsystem can just lock up in this way? Is there a way to clear or restart the VPN subsystem without a full firewall reboot? On this occasion it was late and I had time to debug and restart, but if this occurred during working hours this may not be possible.
I tried everything I could think of to kick start the VPN/ISAKMP side. Disabling ISAKMP on all interfaces, deleting all crypto config lines, re-creating everything from scratch, but nothing would restart it. In the end as I say a reboot of the firewall and it worked immediately. Despite the fact it was back to start-up config and nothing had changed.
Very strange and if anyone can shed some light it would be appreciated.
Couple of thoughts:
1) Sniffer trace, get it. IKE not processing anything typically means it's not receiving packets.
2) If you tell me, "One side can initiate without problems" and "when the other side initiates nothin is happening in debugs" typically would indicate a stateful device (firewall) in the path. UDP sessions, depending on configuration and platform might not be sending anything for a long time.
3) Assuming it's not the above and that it's a mem leak or block depletion. The actual way to clear the situation might be not as easy as clearing IKE or IPsec sessions. Upgrade "affected" firewall to latest-greatest and re-test or check in with TAC.