11-12-2003 06:16 AM
I have a PIX 515 serving as a VPN server for 6 PIX 501's All the 501's are running OS 6.3(1) and our 515 is running 6.3.3.106 (special release from Cisco to fix a problem we were having with the 515 crashing). Occasionally the VPN connections completely drop and do not come back up for about 30 seconds to a minute. I have tried everything that I know how in order to make the VPN tunnel stay up:
1) Increased the SA lifetime to the max
2) ISAKMP keepalive every 60 seconds
3) PIX 501's configured to do NTP across the VPN tunnel
4) Syslog to a server across the VPN tunnel
5) Monitoring server on PIX 515 side pinging inside interfaces of remote 501's across the VPN tunnel every 60 seconds
6) 30 second keep-alives in software that uses the VPN tunnel
Typically these 501's stay up. They might stay up for 4 or 5 days in a row then they will go down 3 or 4 times in a row, and it is usually all of them that go down. It makes me wonder if it is not a connection issue with our ISP...who of course says they are not having any problems. Our 515 log shows nothing, 501 logs show nothing. I have checked the memory usage/cpu usage of our 515 when the problems occur and both are well within normal ranges. CPU usage is almost always less than 30% when the outage occurs.
In doing checks on the interface counters of both our PIX 515's outside interface and our perimeter router I see some errors on the interface including lost carrier and carrier transitions which I think would be enough to disrupt the traffic flow. I cant see timestamps though of when these errors occured so I have no way of knowing if they coincide with the VPN going down.
I have done tracert's to the external IP's of the 501's when the outages occur and can successfuly get to that IP.
The ipsec counters in the 515 show send errors to the various 501's. I was thinking maybe our monitoring server was polling the 501's when the SA's were renewing, but that isnt the case.
I dont know what else to do in order to pinpoint what the problem is here. If anyone could provide some help I would very much appreciate it. I am thinking of downgrading our 515 back to 6.3(1) to see if that helps the problem...but then we will risk having the issue which forced us to get a new special release from Cisco.
Are there any other diagnostic actions I can do before I downgrade the 515 and then open up a TAC case if that doesnt solve the problem?
Thanks
11-12-2003 12:18 PM
UPDATE:
I lowered the logging level on our 515 from 3 to 4 and got the following syslog message:
402101: decaps: rec'd IPSEC packet has invalid spi for destaddr=(Removed), prot=esp, spi=0x8c48
Putting this through the PIX Error Decoder on TAC's site it basically said that this is basically a timing differential on the SA's of both devices. Anyone know why this occurs or what steps I can take to keep it from happening? I have cleared the SA's on both devices numerous times, both after I read this and a long time before it. It doesnt seem to help. Any clues? I have changed a couple of the 501's over to using EasyVPN and so far have not had any problems, even when the other 501's go down.
02-04-2004 02:26 AM
I have exacly the same problem, after 3 to 4 days the traffic between both sides 515(Hun) 501(spoke) disapper... I input the comando sh isakmp sa and the status is IDLE but no traffic is able to pass...
Have you already solved your issue?
Jefferson
02-04-2004 12:39 PM
The best way to find VPN problems is with the debugging, especially ISAKMP, but for intermittent issues that can take a while and produce a lot of output you need to wade through. However, once you've gone through the other more obivous steps, such as making sure *all* the parameters are the same on all the PIXes, about all you have left is the debugging. In the more recent versions of software the VPN debugging messages are actually pretty useful, but you still have to sift through them carefully.
Good luck!
Dana
02-05-2004 03:10 PM
we've got exactly the same error msg. we finally found that is the bandwidth issue at the hq.
we studied the traffic stat on every remote site. whenever the total amount of traffic is high the error occurs.
basically the invalid spi refers to at one end is using an old spi and the other end is using a new spi. since the exchange of spi is very time sensitive and due to the heavy traffic destinated at hq, the exchange was having delay.
the problem get away as soon as we found that and upgrade the bandwidth at hq. hope this help
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide