So I have this issue that affects only my 3750G switches and they are using 15.0(2)SE9 IP Base. Routing looks good all the way to the NPS server and every other device on the same subnet can communicate to NPS just fine and packet captures from the firewall where the NPS server resides shows packets coming in and out. The 3750G can ping to the NPS and vice versa.
Sending a test aaa group radius server command yields this:
002576: Jun 29 14:30:16.681 EDT: AAA/AUTHEN/LOGIN (00000000): Pick method list 'default'
002577: Jun 29 14:30:16.689 EDT: RADIUS/ENCODE(00000000): send packet; FAIL
002578: Jun 29 14:30:16.689 EDT: RADIUS/ENCODE(00000000):Orig. component type = Invalid
002579: Jun 29 14:30:16.689 EDT: RADIUS(00000000): Config NAS IP: *****
002580: Jun 29 14:30:16.689 EDT: RADIUS(00000000): Config NAS IPv6: ::
002581: Jun 29 14:30:16.689 EDT: RADIUS(00000000): sending
002582: Jun 29 14:30:16.689 EDT: RADIUS(00000000): Send Accounting-Request to *******:1646 id 1646/4, len 70
002584: Jun 29 14:30:16.689 EDT: RADIUS: User-Name  10 "******"
002585: Jun 29 14:30:16.689 EDT: RADIUS: Acct-Status-Type  6 Watchdog 
002586: Jun 29 14:30:16.689 EDT: RADIUS: Acct-Session-Id  10 "00000000"
002587: Jun 29 14:30:16.689 EDT: RADIUS: Acct-Authentic  6 RADIUS 
002588: Jun 29 14:30:16.689 EDT: RADIUS: Service-Type  6 Framed 
002589: Jun 29 14:30:16.689 EDT: RADIUS: NAS-IP-Address  6 *****
002590: Jun 29 14:30:16.689 EDT: RADIUS: Acct-Delay-Time  6 0
002591: Jun 29 14:30:16.689 EDT: RADIUS(00000000): Sending a IPv4 Radius Packet
002592: Jun 29 14:30:16.689 EDT: RADIUS(00000000): Started 2 sec timeout
002593: Jun 29 14:30:16.748 EDT: RADIUS: Received from id 1646/4 *****:1646, Accounting-response, len 20
002595: Jun 29 14:30:16.748 EDT: RADIUS/DECODE(00000000): There is no General DB. Reply server details may not be recorded
Packet Capture on the firewall confirms this test is making it to the server and replying back. Yet, I get a send packet: FAIL error.
If I attempt to login to the switch using a radius login, then I get ZERO traffic to the firewall. It's like the switch doesn't even attempt to send the packet at all. I receive this debug on the attempt:
002596: Jun 29 14:33:58.020 EDT: AAA/BIND(000002D2): Bind i/f
002597: Jun 29 14:33:58.020 EDT: AAA/AUTHEN/LOGIN (000002D2): Pick method list 'default'
002598: Jun 29 14:33:58.020 EDT: RADIUS/ENCODE(000002D2): send packet; FAIL
002599: Jun 29 14:34:03.070 EDT: %SEC_LOGIN-4-LOGIN_FAILED: Login failed [user: ] [Source: *****] [localport: 22] [Reason: Login Authentication Failed] at 14:34:03 EDT Wed Jun 29 2016
002600: Jun 29 14:34:03.120 EDT: AAA/AUTHEN/LOGIN (000002D2): Pick method list 'default'
002601: Jun 29 14:34:03.120 EDT: RADIUS/ENCODE(000002D2): send packet; FAIL
This configuration works for every IOS device in my environment, so I'm really not sure what the problem is. I've tried rolling back to earlier versions and I've tried upgrading to ipservices image and it still occurs, but only 3750G switches. I can't help but feel that I'm missing something basic on these switches.
Attaching sanitized config for review.
15.0(2)SE9 is not available for a Cisco 3750G. What model number is reported from "show inventory"?
Your config looks correct to me. I would say you are facing a software bug.
I agree it looks like software:
WS-C3750G-48PS-S from sh inv
Using this bin file: c3750-ipbasek9-mz.150-2.SE9.bin
Cisco IOS Software, C3750 Software (C3750-IPBASEK9-M), Version 15.0(2)SE9, RELEASE SOFTWARE (fc1) from sh ver
I'm showing that version as the latest for that model under the latest tab:
Verified the md5 checksum on the image from that download page. I should add I'm using this image on all of my 3750G switches so software bug seems to be the most likely answer. I'll have to look up the rollback plan to go down to 12.2 but I was trying to avoid dropping the code back to version 12.
Bummer, you are right, that is the most recent software version for the 3750G. You could also try going back one or two patch levels to SE8 or SE7. Slow process though.
I already tried on SE7, so I'm thinking it's either a memory shortage (which I'm not getting errors thrown on) or it's a bug in version 15 for this switch. I picked the image because it was in MD status, but I guess it's not as stable as I thought.
Luckily the switch in question is in a stack, so I can probably downgrade this without an outage (switches are in prod and it's only affecting auth to the switch but will affect dot1x in the future).
Can I use the process here to do downgrade as well? Switch Stack Upgrade
If I was just doing management I might, but I've got everything tied to AD and already set up including different priv levels. I'm not really worried about it until I start pushing DOT1X
Yes I know this, but what's to say TACACS isnt going to do the same thing. I'm not going to build a server to service 6 switches when I have a high availability pair of NPS servers already doing the job.
I am facing the same issue my switch model is 3850 Stack with 3.6.5 IOS. The only solution i have right now is to reboot the stack, did you ever find the solution for your issue ?
I am getting the same debug output and Cisco development team is not able to reach on any conclusion yet from last 4 months
I've got a stack in my lab still trying to find the right code level that doesn't cause this issue. I'm almost 100% sure its a firmware issue.
I can get it to work briefly after a firmware wipe and reload, but once its powered off it loses the ability to communicate to NPS again.
Several. Haven't found a good version yet on 15, but I'm going to try rolling back to older code until I find one that works. Just haven't spent any time on it.