Possible cause of higher than normal radius authentication failure rate on standby (HAGR/ICSR) chass...

David Damerjian · ‎01-24-2012

Article explains that radius aaa authentication failure rate on standby HAGR/ICSR chassis can be abnormally high, even if there is no traffic on the chassis, due to failed login attempts which are (unexpectedly) counted towards the authentication failure rate.

You might wonder why alarms are being generated for radius/aaa authentication rate on standby HAGR/ICSR chassis, when in fact no subscriber traffic is even being processed on those chassis when the alarm is triggered. The reason can be due to failed logins to the chassis, which are actually counted as failed authentications. The tricky thing is that this is not reflected in the "show radius counters all/summary" commands, as the authentication is a "local" authentication attempted by a special instance (#231) of a aaamgr process that resides on the SPC/SMC card for administration (non-subscriber) purposes only. But, the "show session subsystem" command does count all login attempts, success or failure, and these statistics are used in addition to live subscriber authentication attempts to compute the overall authentication failure rate. But on a HAGR/ICSR chassis where there is no live traffic most of the time, chassis logins are the only source of authentications, and just a few failures could easily result in a failure rate higher than as expected with normal subscriber traffic.

Below is an example of the CLI commands showing the issue occurring. In this example, the failure rate is 50%, the poll interval is 5 minutes (300 seconds), and 4 failures and 1 successful login occurred during the poll interval of 6:40 – 6:45 that triggered the failed rate. A failed login can be due to either wrong username or wrong password.

Sun Jun 07 06:40:33 2009 Internal trap notification 143 (LoginFailure) session type CLI ttyname /dev/pts/0 remote ip address 10.2.64.49
Sun Jun 07 06:40:35 2009 Internal trap notification 143 (LoginFailure) session type CLI ttyname /dev/pts/0 remote ip address 10.2.64.49
Sun Jun 07 06:40:38 2009 Internal trap notification 143 (LoginFailure) session type CLI ttyname /dev/pts/0 remote ip address 10.2.64.49
Sun Jun 07 06:40:40 2009 Internal trap notification 143 (LoginFailure) session type CLI ttyname /dev/pts/0 remote ip address 10.2.64.49
Sun Jun 07 06:40:49 2009 Internal trap notification 52 (CLISessStart) user staradmin privilege level Security Admini ttyname /dev/pts/1

The threshold in question is configured in global config mode as follows:

threshold aaa-auth-failure-rate 50 clear 25
threshold poll aaa-auth-failure-rate interval 300 (default)
threshold monitoring aaa-auth-failure

and is viewed with the following command run during the failure period:

[local]CSE2# show threshold

Threshold operation model: ALARM

Configured thresholds:

        Name:             aaa-auth-failure-rate
        Config Scope:     SYSTEM
        Threshold:        50%
        Clear Threshold: 25%

Active thresholds:

        Name:             aaa-auth-failure-rate
        Config Scope:     SYSTEM
        Threshold:        50%
        Clear Threshold: 25%
        Poll Interval:    300Seconds
        Next Poll Time:   2009-Jun-07+06:45:00

Enabled threshold groups: (name, scope)
aaa-auth-failure SYSTEM

No non-default poll interval

No outstanding alarm

After the failure period, the alarm triggered shows 80% failure rate, which would be 4 failed attempts out of 5 total attempts, as mentioned above:

[local]CSE2# show alarm outstanding
Sev Object Event
--- ---------- --------------------------------------------------------------------------------------------------------------------
MN Chassis <24:aaa-auth-failure-rate> has reached or exceeded the configured threshold <50%>, the measured value is <80%>. It is detected at <System>.

The outstanding alarm section of “show threshold” also shows this:

Outstanding alarms:
        Threshold Name:    aaa-auth-failure-rate
        Alarm Source:      System
        Last Measured:     80%
        Raise Time:        2009-Jun-07+06:45:00

… along with the SNMP trap gets triggered at the end of the poll interval:

Sun Jun 07 06:45:00 2009 Internal trap notification 218 (ThreshAAAAuthFailRate) threshold 50 measured value 80

The “show session subsystem facility aaamgr instance 231” command, after having been cleared beforehand with “clear session subsystem”, shows the failed login attempts:

[local]CSE2# show session subsystem facility aaamgr instance 231
AAAMgr: Instance 231
       5 Total aaa requests                  2 Current aaa requests
       5 Total aaa auth requests             0 Current aaa auth requests
       0 Total aaa auth probes               0 Current aaa auth probes
       0 Total aaa auth keepalive            0 Current aaa auth keepalive
       0 Total aaa acct requests             2 Current aaa acct requests
       0 Total aaa acct keepalive            0 Current aaa acct keepalive
       1 Total aaa auth success              4 Total aaa auth failure
       0 Total aaa auth purged               0 Total aaa auth cancelled
       0 Total auth keepalive success        0 Total auth keepalive failure
       0 Total auth keepalive purged
       0 Total aaa auth DMU challenged
       0 Total radius auth requests          0 Current radius auth requests
       0 Total radius auth requests retried
       5 Total local auth requests           0 Current local auth requests
       0 Total pseudo auth requests          0 Current pseudo auth requests
       0 Total null-username auth requests (rejected)
       0 Total aaa acct completed            0 Total aaa acct purged
       0 Total acct keepalive success        0 Total acct keepalive timeout
       0 Total acct keepalive purged
       0 Total aaa acct cancelled
       0 Total radius acct requests          2 Current radius acct requests
       0 Total radius acct requests retried
       0 Total gtpp acct requests            0 Current gtpp acct requests
       0 Total null acct requests            0 Current null acct requests
       0 Total aaa acct sessions             0 Current aaa acct sessions
       0 Total aaa acct archived             0 Current aaa acct archived
       0 Current recovery archives           0 Current valid recovery records
       0 Total aaa sockets opened            1 Current aaa sockets open
       0 Total aaa requests pend socket open
       2 Current aaa requests pend socket open
       0 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding
       0 Total aaa radius coa requests       0 Total aaa radius dm requests
       0 Total aaa radius coa acks           0 Total aaa radius dm acks
       0 Total aaa radius coa naks           0 Total aaa radius dm naks
       0 Total radius charg auth             0 Current radius charg auth
       0 Total radius charg auth succ        0 Total radius charg auth fail
       0 Total radius charg auth purg        0 Total radius charg auth cancel
       0 Total radius charg acct             0 Current radius charg acct
       0 Total radius charg acct succ        0 Total radius charg acct purg
       0 Total radius charg acct cancel
       0 Total gtpp charg                    0 Current gtpp charg
       0 Total gtpp charg success            0 Total gtpp charg failure
       0 Total gtpp charg cancel             0 Total gtpp charg purg
       0 Total prepaid online requests       0 Current prepaid online requests
       0 Total prepaid online success        0 Current prepaid online failure
       0 Total prepaid online retried        0 Total prepaid online cancelled
       0 Current prepaid online purged
       0 Total aaamgr purged requests

Finally, in order to clear the alarm, at least one attempt needs to be made, and of course the total authentication failure rate over the poll period needs to be less than the clear rate, which is 25%. In this case, one successful attempt was made at 6:54:45, which is during the poll interval 6:50 – 6:55, which results in 0% failure during that interval, and so the alarm is cleared at the end of the period, shown by the SNMP trap:

Sun Jun 07 06:54:45 2009 Internal trap notification 52 (CLISessStart) user staradmin privilege level Security Admini ttyname /dev/pts/1

Sun Jun 07 06:55:00 2009 Internal trap notification 219 (ThreshClearAAAAuthFailRate) threshold 25 measured value 0

Imported from Starent Networks Knowledgebase Article # 10480

Possible cause of higher than normal radius authentication failure rate on standby (HAGR/ICSR) chassis