Solved: Dynamic VLAN assignement works randomly on 2960-X after upgrading in 15.2(7)E1

AURELIEN MERE · ‎01-07-2020

Hello

After upgrading our 2960-X stacks from 15.2(6)E3 to 15.2(7)E1, dynamic vlan assignment through MAB works randomly.

We sometimes encounter this kind of errors :

%PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)

the port stays in VLAN 1 and MAC address on the port is in "Drop" state in the MFT.

We have the problem only with 2960-X stacks in 15.2(7)E1, all others are working fine.

RADIUS Debug show that the Tunnel-Private-Group-Id is correctly received on the switch, but somehow is discarded.

Have you encountered the same issue ?

Thanks for your help

Aurélien

AURELIEN MERE · ‎05-11-2021

We also have been running 15.2(7)E4 on small production environment since the release without any problems on 2960X/C/CX.

Bug seems to be finally fixed.

Btw, the release is now "suggested" / yellow-star on Cisco dowloads, we are currently planning the upgrade of all production switches.

View solution in original post

pieterh · ‎01-07-2020

found this bug that is not exactly your issue, but...

can you find if there is any relation with assigning voice vlan?

AURELIEN MERE · ‎01-07-2020

Thanks for your quick answer. Symptoms are cleary similar.

It -seems- to happen only on data vlan, whether a voice device is present on the port or not.

I will check if I can reproduce it without the "switchport voice vlan XX" configured on the port.

Here is the debug output in case of failure, showing successful authentication with access-accept reply with VLAN parameters, but crashing at the end.

11706: Jan 7 11:09:40.906: RADIUS/ENCODE: Best Local IP-Address 192.168.248.24 for Radius-Server <RADIUS IP>
11707: Jan 7 11:09:40.906: RADIUS: Message Authenticator encoded
11708: Jan 7 11:09:40.906: RADIUS(00000000): Send Access-Request to <RADIUS IP>:1812 onvrf(0) id 1645/242, len 261
11709: Jan 7 11:09:40.906: RADIUS: authenticator 1E F0 36 F8 9F 21 A2 41 - 0B A5 09 C6 11 7E ED 84
11710: Jan 7 11:09:40.906: RADIUS: User-Name [1] 14 "b083feafc7e1"
11711: Jan 7 11:09:40.906: RADIUS: User-Password [2] 18 *
11712: Jan 7 11:09:40.906: RADIUS: Service-Type [6] 6 Call Check [10]
11713: Jan 7 11:09:40.906: RADIUS: Vendor, Cisco [26] 31
11714: Jan 7 11:09:40.906: RADIUS: Cisco AVpair [1] 25 "service-type=Call Check"
11715: Jan 7 11:09:40.906: RADIUS: Framed-MTU [12] 6 1500
11716: Jan 7 11:09:40.909: RADIUS: Called-Station-Id [30] 19 "38-ED-18-92-83-A8"
11717: Jan 7 11:09:40.909: RADIUS: Calling-Station-Id [31] 19 "B0-83-FE-AF-C7-E1"
11718: Jan 7 11:09:40.909: RADIUS: Message-Authenticato[80] 18
11719: Jan 7 11:09:40.909: RADIUS: F3 14 AA AA 86 2B C3 AD 79 68 0A A2 E2 22 74 3E [ +yh"t>]
11720: Jan 7 11:09:40.909: RADIUS: EAP-Key-Name [102] 2 *
11721: Jan 7 11:09:40.909: RADIUS: Vendor, Cisco [26] 49
11722: Jan 7 11:09:40.909: RADIUS: Cisco AVpair [1] 43 "audit-session-id=C0A8F818000005F502F03F4F"
11723: Jan 7 11:09:40.909: RADIUS: Vendor, Cisco [26] 18
11724: Jan 7 11:09:40.909: RADIUS: Cisco AVpair [1] 12 "method=mab"
11725: Jan 7 11:09:40.909: RADIUS: NAS-IP-Address [4] 6 192.168.248.24
11726: Jan 7 11:09:40.909: RADIUS: NAS-Port-Id [87] 23 "GigabitEthernet6/0/40"
11727: Jan 7 11:09:40.909: RADIUS: NAS-Port-Type [61] 6 Ethernet [15]
11728: Jan 7 11:09:40.909: RADIUS: NAS-Port [5] 6 50640
11729: Jan 7 11:09:40.909: RADIUS(00000000): Sending a IPv4 Radius Packet
11730: Jan 7 11:09:40.909: RADIUS(00000000): Started 5 sec timeout
11731: Jan 7 11:09:40.951: RADIUS: Received from id 1645/242 <RADIUS IP>:1812, Access-Accept, len 37
11732: Jan 7 11:09:40.951: RADIUS: authenticator D8 F5 04 A9 FA F7 78 91 - CC 1E F8 F3 02 84 6B B6
11733: Jan 7 11:09:40.951: RADIUS: Tunnel-Type [64] 6 00:VLAN [13]
11734: Jan 7 11:09:40.955: RADIUS: Tunnel-Medium-Type [65] 6 00:ALL_802 [6]
11735: Jan 7 11:09:40.955: RADIUS: Tunnel-Private-Group[81] 5 "700"
11736: Jan 7 11:09:40.955: RADIUS(00000000): Received from id 1645/242
11737: Jan 7 11:09:40.958: %MAB-5-SUCCESS: Authentication successful for client (b083.feaf.c7e1) on Interface Gi6/0/40 AuditSessionID C0A8F818000005F502F03F4F
11738: Jan 7 11:09:40.962: %PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)
1739: -Traceback= 7E23BCz 373FC34z 37B26D4z 37883A8z 3754364z 378DE78z 31440E4z 3153CB4z 315591Cz 3172634z 31755ECz 1D97F10z 1D8E594z 1D929F4z 29580E4z 1D92C78z

Sometimes it works correctly on the same switch, same port after several shut/no shut, sometimes not. With exact same debug output, except that port goes up instead of internal error

Thanks

Aurélien

AURELIEN MERE · ‎04-23-2020

After testing, problem is still present in 15.2(7)E2.

Apr 22 16:15:03.574: %PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)

Seems that radius reply packet is no longer interpreted correctly. Has anybody encountered this kind of issue?

Y C · ‎07-02-2020

Any resolve to this? I have the same exact error.... PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist) after upgrading our fleet of 2960x-48fpd-l to 15.2(7)E2 - the port seems to randomly fail dot1x wired and revert to mab. ISE does not show any failures. On it's own it will switch to dot1x again and proceed successfully. I've ran that code on my test/lab switch for some time without issue so I thought this release was good.

AURELIEN MERE · ‎07-07-2020

Hello

I have a TAC case opened but for the moment, no workaround has been found.

It seems they found a bug related to usage of 802.1X and IP Device Tracking, but as it is not possible to disable the entire IPDT feature, we weren't able to validate that this is the real problem.

If you can, I would consider opening a TAC case to provide more input as the random impact makes it difficult to get relevant data. I will keep you informed if a solution is found.

Best regards

Aurélien

Y C · ‎07-07-2020

I do have a TAC case open. We ended up reverting our switches back to what they were running, 15.2.2E3.

I had a hard time replicating it on my lab switch. I was finally able to after putting "authentication mac-move permit", that was on our production switches, on my lab switch. Now I can get it fairly consistently. I noticed it doesn't always correlate with that traceback posted above though. Infact it seems more often then not the traceback isn't shown so that may or may not be a separate issue.

Dan Coats · ‎08-06-2020

I can confirm the same. 15.2(7)E2 - The port seems to randomly fail dot1x wired. Reverted back to 15.2(7)E0a

Y C · ‎08-06-2020

Dan - do you have authentication mac-move permit turned on? Try turning it off if so.

I ended up having to close my TAC case. We found an (unrelated) issue - STP TCN's from nearby access layer switches were reaching the switches connecting vmware hosts (and consequently ISE) to our network when they shouldn't have been. This caused ISE machine mac's to have to be constantly removed and then re-learned from the table. There were never any reachability issues between ISE and the NAD's during tests with pings. Once this issue was resolved I couldn't replicate anymore on our lab switch even with mac-move turned on.

Coincidence or not, I can't tell... but it seems that combined with our short timings (we had to set dot1x timeout to 1s for PXE boot to work) the mac learning issue was causing this. Also, watch the ISE version you're using... we are still on 2.1, which officially only supports up to 15.2.2E4 on 2960x's.

lostomania · ‎09-30-2020

We are experiencing the same issues on our catalyst 1000 switches.

The conditons are the same; They are stacked and we are running dot1x with dynamic vlan assignments from a Radius server.

It seems the problem is only appearing when we are connecting endpoint to one of the slave switch. The Master switch seems to be ok.

The problem is resolved if we "break" the stack and make all switches isolated.

This makes me wonder if the radius messages gets corrupted through the stack links.

Here is my logs for reference:

000105: Sep 30 13:40:17.909 BST: %PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)
000115: Sep 30 14:07:33.418 BST: %PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)
000122: 000026: Sep 30 15:18:09.462 BST: %PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)
000126: Sep 30 15:18:42.032 BST: %PM-3-INTERNALERROR: Port Manager Internal Software Error (vlan > 0 && vlan < PM_MAX_VLANS: ../switch/pm/pm_vlan.c: 878: pm_vlan_test_portlist)

We need a hotfix for this ASAP.

pieterh · ‎09-30-2020

the problem you describe is something completely different than the original post
it does looks like a software problem, but has nothing to do with ACL's
-> you better start a new thread

>>> We need a hotfix for this ASAP <<<
if that's the case, your action is to open a Cisco TAC case not replying to a community thread from sometime ago?

be aware the community is volunteers that want to help, and may have experienced the issue before

it is not Cisco's professional helpdesk and the community cannot write hotfixes

lostomania · ‎10-01-2020

Sorry for not explaining properly, but I disagree with your assessment that this is "completely different".

We already have a TAC case going for this, and it is linked it to the symptoms of CSCvf23606, where the conditions is the same: stacked switches using dynamically assigned vlans from Radius.

Only difference is that we are running on Cat1000 switches.

If we split the stack, the problems dissapear, and I thought that would be helpful for the other people in this thread that dont want to configure all of the vlans manually. It could be that this workaround only applies to cat 1000 though, but thats what community threads are for, right?

Georg Pauwen · ‎10-01-2020

Hello,

sounds like it could be a problem with spanning tree and BPDUs. Can you post the configs of two connected stack switches ? As well as the output of:

debug spanning-tree bpdu receive

from both those switches ?

Melantrix · ‎10-29-2020

Hello,

I can also confirm that 15.2.(7)E2 still has this issue.
We recently upgraded from 15.2.(7)E0a to 15.2(7)E3 but that completely crashed multiple stacked switches.

We reverted to 15.2(7)E2 and then encountered the the issues described here.
Also STP errors came up, lots of err-disabled ports (and port-channels unfortunately) and 4 2960xr crashed completely (stack of 7) and more and more weird errors.

It all looks like the result of this bug which made everything say 'nope'.

Y C · ‎10-29-2020

STP, error disabled ports.... makes it sound like a loop somewhere. This reminded me of a side effect we had when we went through this upgrade / downgrade ordeal. On all our access ports in our original config we had spanning-tree portfast applied. When upgrading to 15.2.7 it automatically changed it to spanning-tree portfast edge. Problem is when downgrading again the "edge" keyword isn't supported so the entire line got removed. This caused access ports to send tcn's triggering other issues. That took a while to figure out. Luckily you can apply portfast to access ports with one global command.

Incidentally, if you're running ISE, what version? TAC pointed out that when we went through this the version ISE we were on only supported up to a particular 2960x code. 15.2.7 was newer then what was listed