05-10-2011 06:43 AM
I intermittently see one of two particular tunnels go down. The applications crossing those tunnels get disconnected -- including the ping streams which the monitoring stations employ, whereupon they send us pages -- users start calling. The tunnel comes back to life eithr via administrative intervention (clearing the association) or by waiting for the rekey timer to expire.
We can mitigate the problem by increasing the 'lifetime kilobytes' parameter and by decreasing the 'lifetime seconds' parameter. However, this only reduces the frequency of the issue; it doesn't eliminate it.
Through tests, we have persuaded ourselves of the following:
(a) If one side or the other counts to ~75% of its 'lifetime seconds' parameter, it initiates rekeying, rekeying occurs, the tunnel stays up, everyone is happy.
(b) If one side or the other counts to 100% of its 'lifetime kilobytes' parameter, the tunnel goes down and stays down until one side or the other encounters condition (a)
We believe that 4GB is the maximum we can set for 'lifetime kilobytes'. A couple of our tunnels are carrying increasing amounts of traffic (medical images); during peak times, we believe that we can blow through 4GB in ~15 minutes. We expect traffic volumes to increase.
On one end, we employ an IPSec SPA module (SPA-IPSEC-2G on C6K running 12.2(33)SXI3) to terminate tunnels (~20 tunnels, only two of which pass significant volumes of traffic). We have replicated this issue using those two heavily utilized tunnels, one of which is terminated by a Juniper ISG 1000; the other of which is terminated by an ASA5550.
I've been poking through RFC5996; section '2.8 Rekeying' seems to be the relevant section. I see no mention of 'data lifetime' (aka 'lifetime seconds') as an initiator for rekeying ... but then I suppose this is an implementation detail, not a protocol detail (i.e. I suppose the protocol is agnostic on what event initiates rekeying; it just specifies how rekeying gets done).
I can see variants of this issue mentioned in this forum over the years:
https://supportforums.cisco.com/message/3030926#3030926
https://supportforums.cisco.com/message/3216115#3216115
The latter post claims that 'lifetime seconds' must be the same on both sides of the tunnel; we haven't tried this, but I'm skeptical -- given the problems which can occur when both sides initiate a rekey simultaneously, this sounds like a bad idea ... and in any case, the protocol recommends instituting 'jitter' in the rekey timers, to reduce the frequency of precisely this event.
Would anyone have insights to shed on this? Seems to me that I should be able to configure 'lifetime seconds' and 'lifetime kilobytes' to anything I like on either side, and the implementation should just work: whichever side sees its 'seconds' or 'kilobytes' counter expire first should initiate rekeying, and life should be good. Obviously, however, my imagination oustrips reality.
Config snippets from the tunnel hub:
ip access-list extended x-y-impac
deny ip host a.b.c.d a.b.c.0 0.0.0.255
deny ip host a.b.c.e a.b.c.0 0.0.0.255
permit ip x.y.z.0 0.0.0.255 a.b.c.0 0.0.0.255
deny ip any any log
!
crypto ipsec transform-set impacset esp-aes esp-sha-hmac
!
crypto map medvpn 5 ipsec-isakmp
set peer a.b.c.f
set security-association lifetime kilobytes 4194300
set security-association replay disable
set transform-set impacset
match address x-y-impac
reverse-route remote-peer j.k.l.m
ip access-list extended x-pacs
[...~15 permit statements, mostly /32, a few /24 ...]
deny ip any any log
!
crypto ipsec transform-set pacsset esp-aes 256 esp-sha-hmac
!
crypto map medvpn 70 ipsec-isakmp
set peer a.b.c.d
set security-association lifetime seconds 600
set security-association replay disable
set transform-set pacsset
match address x-pacs
reverse-route remote-peer j.k.l.m
Suggestions welcome.
--sk
Stuart Kendrick
FHCRC
05-13-2011 11:52 AM
Hello there. I'm experiencing the same problem. Have you been able to resolve?
05-13-2011 12:15 PM
I have a TAC case open; here's what I think I know.
VPN terminators initiate rekey based on two parameters 'lifetime seconds' and 'lifetime kilobytes'. i.e. if the counter tracking time gets close to zero first, then the termminator initiates rekey. One the rekey is done, both timers reset and start counting down again. or, if the 'lifetime kilobytes' counter gets close to zero first, then the terminator initiates a rekey, same deal.
However. Some devices, when their 'lifetime kilobyte' timer (aka 'volume rekey timer') counts down to zero, will (a) quit using the tunnel, and (b) /not/ initiate a rekey. This breaks the tunnel. This is a also a bug. But apparently a common bug. The usual work-around is to disable the 'lifetime kilobytes' timer.
In theory, this exposes the tunnel to more risk -- the more data which flows across a tunnel using a single set of keys, the more chance an attacker has of guessing those keys. For the moment, let's say that we work in an environment which wants to keep tunnels up more than they want to resist this particular type of attack.
The command we want is:
set security-association lifetime kilobytes disable
crypto map medvpn 5 ipsec-isakmp
set peer a.b.c.f
set security-association lifetime kilobytes disable
set security-association replay disable
set transform-set impacset
match address x-y-impac
[I've disabled replay on account of another popular bug, this one on the far end device, which intermittently loses count of the sequence numbers used in this feature and rejects frames. But, for the purposes of this discussion, it is the 'set security-association lifetime kilobytes disable' command which is relevant.]
I'm starting the change management ball rolling, to try out this fix -- in one particular case, I have devices on both ends which support disabling volume rekey (IPSec SPA on one end and Juniper ISG on the other).
However, I have other tunnels terminated on ASA. And ASA does not support disabling volume rekey. And TAC believes that ASA, at least some versions, are prone to this bug. [I can attest to v8.0(5) being prone to it, haven't tried any others.]
Next on my plate:
(a) Search the bug database to see if ASA fixes this volume rekey bug in some more recent version
(b) Ask my SE to submit an RFE for disabling volume rekey under ASA.
hth,
--sk
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide