Re: Backup ACS server not used by switch.

ulineosan · ‎01-02-2012

I am experiencing a strange issue: During a primary ACS failure, our switches are not resorting to the backup ACS for login authentication, except for enable mode. This means we can only use the emergency local login, but once logged in we cannot enable due to the switch attempting to authenticate that to the backup ACS.

Once I created the local user in the backup ACS I was able to log in, and after I removed then re-addded the primary server as a TACACS host it worked as expected - using the backup only. I can't help but think there is some minor command I am missing so that the switches will recognize the failure of the primary ACS.

What am I missing that a failure of an ACS server does not cause the switches to use other configured servers?

camejia · ‎01-03-2012

Hello,

From your description it seems that your Cisco IOS switches are not contacting the secondary/backup ACS server when the primary is down in order to authenticate the SSH/Telnet management access username/password credentials through TACACS+. However, the IOS devices are contacting the secondary server when trying to access "Enable Mode" when the primary server is down.

Can you please add your IOS configuration? If not all, can you include the AAA configuration?

Also, if possible, can you enable "debug aaa authentication", "debug aaa authorization", "debug aaa accounting" and "debug tacacs" and recreate the issue? Please share the outputs as well.

If you have your IOS switches configured as follows:

tacacs-server host 1.1.1.1 key xxxxx

tacacs-server host 1.1.1.2 key xxxxx

The IOS devices should automatically perform the "failover" to the second TACACS+ server configured with the "tacacs-server" command. No additional command is needed other than the server entries defined on the IOS configuration in order for the IOS to contact additional servers in case of a failure.

Will be waiting for your response.

Regards.

ulineosan · ‎01-03-2012

You described my problem perfectly. Here is some additional info, and thanks for taking the time to look at it.

Switch model: WS-C3750-48TS

IOS version: 12.2(55)SE3

This form is not letting me paste, so please see the attached file for debugging output.

camejia · ‎01-03-2012

Richard,

I have reviewed the information, however, the debugs are not clear enough as the only outputs displayed other than Accounting logs are the following lines:

012697: Jan 3 22:37:16.866 GMT: AAA/AUTHEN/LOGIN (0000094B): Pick method list 'default'

012698: Jan 3 22:37:24.743 GMT: AAA/AUTHEN/LOGIN (0000094B): Pick method list 'default'

There are known issues with IOS devices not triggering the fallback/failover to the secondary ACS/TACACS+ server when the primary returns an "ERROR" response. "ERROR" refers to a process failure on the server side dropping the request and would not be the same as User Invalid or Bad Password responses which are failures referring to the Authentication information and not the process itself.

Would it be possible for you to collect a capture on the Secondary ACS switchport while the primary is down in order to determine if the IOS device is reaching the secondary server at all?

Known issue:

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsd48175

Symptoms

AAA does not failover to the backup tacacs server defined when it receives ERROR

from the primary server .

Conditions

Occurs when tacacs is configured for authentication, and backup servers are

configured. When the primary server returns error due to csauth not running on

the primary server, in that case authentication request does not fail over to

secondary server.

Frequency:

Not a common scenario.

Workaround:

None

NOTES

1) If you have an ACS for Windows (3.x or 4.x) then you can install Wireshark on the Windows Server and collect the capture.

2) If you have an ACS Appliance (3.x or 4.x) or an ACS 5.x you might need to configure a SPAN session on the switch.

After collecting the capture you can use Wireshark > Edit > Preferences > Protocols > TACACS+ > TACACS+ Encryption Key > type the shared secret value. This will allow you to review the unencrypted packets.

You can filter the capture as well using ip.addr==x.x.x.x where x.x.x.x is the IOS device IP address.

Feel free to share the capture with me as well along with the shared secret key. I would gladly review the information.

NOTE: If the capture shows no traffic going to the secondary unit a useful test would be to configure the "Secondary" server as the primary on the IOS and verify if it works that way.

NOTE: If possible, a capture on the primary server switchport while it is down might be useful in order to verify how is the IOS determining that the primary server is down as I do not see it trying to contact the primary either... We should see atleast timeouts when contacting the primary ACS.

Regards.

camejia · ‎01-03-2012

Richard,

Another detail, can you confirm that you enabled "debug tacacs" on the IOS device when testing as I find it quite unexpected that there are only AAA debugs and not TPLUS outputs included on the logs.

If not, please enable "debug tacacs" and do not enable "debug aaa accounting" and test again. Share the outputs one more time. The TACACS+ debugs should show the "ERROR" as well if that is the case so we can avoid the captures and analyze the TACACS+ debugs first.

Regards.

ulineosan · ‎01-03-2012

I don't think it is that bug you mentioned because it was supposedly fixed in 12.2(44)SE, while we are running a newer version and it seems to be all devices affected, not just the 3750s.

I ran wireshark on the backup ACS server during a login attempt to a different 3750 (the previous one has been fixed by removing then re-adding the malfunctinoing TACACS host) which is the same model and software version. I could not find any packets from the switch in the packet capture, except when using the enable command after logging in with the local emergency account. I am not surprised by the result since the logs within ACS also show the same thing.

I believe I missed entering the debug TACACS command previously and have, as you asked, included it in this test. In this test, I attempted to log in as myself, then used the local emergency account and subsequent enable mode, I then attempted several more logins as myself.

camejia · ‎01-04-2012

Richard,

I have performed a quick lab recreation with a 3750 Running 12.2(55)SE3 and I am not facing the same behavior. I have two servers configured (.21 as primary and .20 as secondary). I stopped the ACS services on the primary and the switch was able to contact the secondary server with your same configuration:

*Mar 1 00:13:58.969: TPLUS: Authentication start packet created for 9()

*Mar 1 00:13:58.969: TPLUS: Using server x.x.250.21

*Mar 1 00:13:58.969: TPLUS(00000009)/0/NB_WAIT/5F8E138: Started 5 sec timeout

*Mar 1 00:13:58.978: TPLUS(00000009)/0/NB_WAIT: write to x.x.250.21 failed with errno 257((ENOTCONN))

*Mar 1 00:13:58.978: TPLUS: Authentication start packet created for 9()

*Mar 1 00:13:58.978: TPLUS: Choosing next server x.x.250.20

As you can see the switch was able to move to the next configured server. I then tried with an invalid TACACS+ server as primary and I got the same results. Invalid Address was 1.1.1.1

*Mar 1 00:12:37.399: TPLUS: Authentication start packet created for 8()

*Mar 1 00:12:37.399: TPLUS: Using server 1.1.1.1

*Mar 1 00:12:37.399: TPLUS(00000008)/0/NB_WAIT/5E8CE00: Started 5 sec timeout

*Mar 1 00:12:42.407: TPLUS(00000008)/0/NB_WAIT/5E8CE00: timed out

*Mar 1 00:12:42.407: TPLUS: Choosing next server x.x.250.20

At this point, one more test to perform would be to disable single-connection on your IOS configuration and on the ACS AAA Client entry as well.

no tacacs-server host x.x.x.x single-connection

no tacacs-server host x.x.x.y single-connection

tacacs-server host x.x.x.x

tacacs-server host x.x.x.y

We should test at this point.

Also, perform the "show tcp brief" and verify that there are no multiple TCP Port 49 in CLOSEWAIT or a different status other than ESTABLISHED. If there are multiple TCP Port 49 on an unexpected state there is something else wrong.

If none of the above suggestions work I think the next step would be to open a TAC case in order to perform a deeper troubleshooting on your side.

Will be waiting for your response.

Regards.