Solved: Switchport fail open to current VLAN?

Josh Morris · ‎05-08-2019

I am trying to come up with some resiliency options in case of a massive ISE failure. I have a critical VLAN setup on all switchports in case ISE is unreachable. My question is how do I deal with switchports that use dynamically assigned VLANs? If ISE were to fail, the switchport would fail to the critical VLAN, which would be different than the dynamically assigned VLAN, so the client would stop working.

My only idea is to write a script that would go out daily, get the current dynamic VLAN, and change that port's critical VLAN to the currently assigned VLAN.

Thanks.

howon · ‎05-10-2019

Yes, that is the keepalive. However keepalive is not the only criteria. When new endpoint tries to connect or reauth timer expires, then those real user authentication will be used to mark the server down. The fact that you don't have deadtime configured makes matter worse since the servers are marked alive immediately. Due to that it will move endpoints to critical VLAN one by one as reauth timer expires as it marks the server dead > alive > auth timeout > dead > alive > auth time out > dead...

By using keepalive, you are saving user authentication from being used and impacting them since keepalive will ensure the server is down and no need for actual user authentication.

So in addition to keepalive, make sure to have deadtime configured for 10 - 15 minutes.

View solution in original post

Aravind Ravichandran · ‎05-08-2019

IBNS 2.0 can handle critical failure very well. Please refer to this link

-Aravind

howon · ‎05-08-2019

That is a common mis-conception regarding critical VLAN. When critical event occurs, endpoints already authenticated stays in whatever state they were in whether it be VLAN, ACL, etc. So clients will keep on working unless the specific client disconnects and reconnects to the interface. Only clients trying to authenticate during critical state will be dropped into the critical VLAN.

Josh Morris · ‎05-08-2019

Then it sounds like I am causing myself further issues by having the switchport control the reauthentcation timer, because if the timer expires during a critical event, the endpoint will fail and be put on the critical VLAN. Changing the reauth timer to be on the ISE server instead of the switchport is something else I've been investigating. Thanks.

howon · ‎05-08-2019

Reauth timer should cease while in critical state to avoid that issue as long as you are using RADIUS keepalive to check the RADIUS server status in check from the Catalyst device. It will not force endpoints to reauth when it knows that no RADIUS servers are available.

Josh Morris · ‎05-08-2019

Thanks howan. I enjoyed your top 10 mistakes document.

I have a different experience, however, regarding the critical VLAN. I had a critical ISE failure with 2.2 patch 10 which caused my PSNs to stop authenticating. The PSNs were up, but wouldn't auth. My PSNs are behind an f5, and the VIP went down when all PSNs stopped responding on 1812. All my switchports are set to a two hour re-auth time. My outage lasted longer than two hours, so when the devices that were set to dynamically change VLAN tried to reauth and the NAD couldn't reach ISE, I found that they failed auth and went to the critical VLAN, which meant they couldn't communicate at all.

Here is my switch radius config.

aaa group server radius ISE_RADIUS
server name ISE_PSN_VS
aaa server radius dynamic-author
client 10.200.44.20 server-key xxx
client 10.200.0.25 server-key xxx
client 10.200.0.26 server-key xxx
ip radius source-interface Vlan4000
snmp-server enable traps trustsec-server radius-server provision-secret
radius-server attribute 6 on-for-login-auth
radius-server attribute 6 support-multiple
radius-server attribute 8 include-in-access-req
radius-server attribute 25 access-request include
radius-server dead-criteria time 30 tries 3
radius server ISE_PSN_VS
address ipv4 10.200.44.20 auth-port 1812 acct-port 1813
pac key xxx

I should also point out that my switchports are configured for multi-auth, so wouldn't the attached image indicate that the clients would change VLANs in a critical outage?

howon · ‎05-08-2019

You are missing RADIUS keepalive. Without this, actual authentication will be used to realize RADIUS server is down where previously connected endpoints will be impacted as in your example. You can use following command:

radius server ISE_PSN_VS
address ipv4 10.200.44.20 auth-port 1812 acct-port 1813

automate-tester username RAD-TEST ignore-acct-port (probe-on)

Good catch. Yes with reinitialize keyword, existing devices on the same interface will be put into critical VLAN. However, it may no longer need to be the case with newer switches where it can assign multiple VLANs on a single interface. 2960X, 3650, 3850, and all 9K can assign multiple VLANs on a multi-auth port, so may want to try multi-auth with authorize keyword.

Josh Morris · ‎05-10-2019

@howon Thanks. Are you saying that the 'automater-tester' command is the actual keepalive? So if the probe fails, thats when the dead criteria come into play?

Would this still be necessary since I have my PSNs behind an f5 that is also performing a probe? The f5 is authenticating regularly, and if the probe fails, it marks the PSNs DOWN. At that point, the switch would not get an 1812/1813 response from the f5 VIP, so the server goes DOWN (but comes back ALIVE quickly because I dont have the dead-timer extended).

howon · ‎05-10-2019

Yes, that is the keepalive. However keepalive is not the only criteria. When new endpoint tries to connect or reauth timer expires, then those real user authentication will be used to mark the server down. The fact that you don't have deadtime configured makes matter worse since the servers are marked alive immediately. Due to that it will move endpoints to critical VLAN one by one as reauth timer expires as it marks the server dead > alive > auth timeout > dead > alive > auth time out > dead...

By using keepalive, you are saving user authentication from being used and impacting them since keepalive will ensure the server is down and no need for actual user authentication.

So in addition to keepalive, make sure to have deadtime configured for 10 - 15 minutes.