Solved: Re: DOT1X_SWITCH messages causing high response times

EduardR · ‎08-26-2019

Hello fellow experts,

We are facing some issues in our Catalyst stack switches that makes them unresponsive for administration. Actually we have some 802.1X configuration against a third party vendor, and sometimes we are getting a bunch of messages of this type:

%DOT1X_SWITCH-5-ERR_VLAN_EQ_VVLAN: Data VLAN <<>> on port <<PORT>> cannot be equivalent to the Voice VLAN AuditSessionID 2

The issue is not on all ports, but in 2 or 3 of them. When it appears the switches logs gets full quickly and the switches started to answer with several seconds of delay, making them innaccesible by VTY or even by SNMP. We think the issue is caused by some wrong configuration in some clients (IP phones) and we are looking further in that way, but we need to stop the blocking behaviour, it appears that the logging speed is so high that the system is not capable of writing all the logs down, and that I/O bottleneck is causing the switch to get stuck.

Is there anyway we can stop that logs writing to the flash, or at least reducing its appereance? Or, better, it is possible to configure some ERR-Disable policy for blocking those ports in the moment it appears the misconfiguration? We have digged the CISCO documentation and the only related stuff we found was "Change either the voice VLAN or the IEEE 802.1x-assigned VLAN on the interface so that they are not the same.", but is not enough to stop the blocking.

Any idea will be appreciated.

Best regards.

EduardR · ‎09-26-2019

Hello all, and thank you for your responses. At this moment we have configured some "workaround" and all its working pretty good, but we still have some doubts and we want to shared them with all of you, just in case someone found them usefull.

We checked configuration and procedures and found some race condition that was started by one of our change procedures, like this:

If we execute this command sequence:
```
switch(config)#interface GigabitEthernetX/Y/Z
switch(config-if)#no switchport voice vlan A
switch(config-if)#switchport voice vlan B
```
The error pops up inmediatly and the switch started to slowdown:
```
%DOT1X_SWITCH-5-ERR_VLAN_EQ_VVLAN: Data VLAN ^A on port <<PORT>> cannot be equivalent to the Voice VLAN AuditSessionID 2
```
We think that it is due to some coding rule in the IOS that doesn't have enough validation or that is heavy coupled with other data structure and when we launch the "no switchport voice vlan" some variable got empty and some other process got buggy about it (VLAN data its empty, and appears some non-numerical char "^A")
At this moment, the only workaround its shutting down the port, defaulting its configuration and reconfigure all the parameters. If this is not done, the port keeps generating the error, but if you check the configuration all looks ok (with the new VLAN data).
The cause of the poor times is because the logs are so excessive that cause a process lock in the memory writing process
```
%PARSER-6-WMLRETRY: Write memory lock currently held by pid <<PID>>, automatic retry
```

For solving this, we just changed the procedure with one of this two sequences:

Defaulting the port and loading again all the configuration:
```
switch(config)#default interface GigabitEthernetX/Y/Z
```

Removing the auth command for voice before changing it:

switch(config)# interface GigabitEthernetX/Y/Z
switch(config-if)# no authentication event server dead action authorize voice
switch(config-if)# no switchport voice vlan A
switch(config-if)# switchport voice vlan B
switch(config-if)# authentication event server dead action authorize voice

And we were happy about it.... BUT, the CISCO TAC were checking the issue too and just told us that all the issue was caused because some bad practice in the port configuration and the solution was increasing the dot1x timeout tx-period to 10 seconds. this proved to work too... but we found that answer pretty empty. But maybe was just that simple. The configuration is:

switch(config)# interface GigabitEthernetX/Y/Z
switch(config-if)#  dot1x timeout tx-period 10

Is up to you what to use.

Thank you all for the ideas and the collaboration, we hope this post helps somebody else out there.

View solution in original post

Mark Elsen · ‎08-27-2019

- You may also want to look at the current software version being used on your platform. Sometimes such issues are caused by bugs.

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

EduardR · ‎08-28-2019

We have upgraded our switches by our Cisco channel recomendation, atm we have Version 16.8.1r [FC4], we don't have found any bug related specifically for this. By the way, we will ask CISCO directly for this. Thank you

Georg Pauwen · ‎08-27-2019

Hello,

if you want to get rid of the log messages, until you have found the cause of the problem, you can configure a logging discriminator:

switch(config)#logging discriminator DOT1X severity drops 5 facility drops ERR_VLAN_EQ_VVLAN
switch(config)#logging buffered discriminator DOT1X
switch(config)#logging console discriminator DOT1X
switch(config)#logging monitor discriminator DOT1X

EduardR · ‎08-28-2019

Thank you!, we will check that, i'll let you know how it goes.

Thomas Bille Joergensen · ‎08-27-2019

Please supply port configuration of a port with the problem along with model of IP Phone attached and the same for a port without the problem.

Also "sh proc cpu sorted" to see if a specific process is running high CPU.

EduardR · ‎08-28-2019

Hi, the port configuration is "standard" in all the stack and looks like this one:

interface <<INTERFACE>>
 switchport access vlan <<DATAVLAN>
 switchport mode access
 switchport voice vlan <<VOICEVLAN>>
 authentication event fail action authorize vlan <<PREAUTHVLAN>>
 authentication event server dead action authorize vlan <<DATAVLAN>>
 authentication event server dead action authorize voice
 authentication event no-response action authorize vlan <<PREAUTHVLAN>>
 authentication host-mode multi-domain
 authentication order mab dot1x
 authentication priority dot1x mab
 authentication port-control auto
 authentication periodic
 authentication timer reauthenticate 28800
 authentication violation restrict
 mab
 storm-control broadcast level 0.50
 storm-control multicast level 0.50
 storm-control action shutdown
 dot1x pae authenticator
 dot1x timeout tx-period 1
 spanning-tree portfast
 spanning-tree bpduguard enable
end

The IP Phone is a CISCO CP-7841.

The "sh proc cpu sorted" cannot be checked, the stack got unresponsive also from console. We needed to power cicle the stack to recover the management.

EduardR · ‎08-28-2019

We got some update,

We checked the last logs before Power Cycling the stack and we found this message:

%PARSER-6-WMLRETRY: Write memory lock currently held by pid '504', automatic retry.

Sadly, at that moment we couldnt check the PID owner, and at this moment the 504 is held by NTP (Don't think so is related).

Thomas Bille Joergensen · ‎08-29-2019

I think I would try to swap phones on different ports and see if the issue follow the phone. Also I would try to simplify the configuration on a port, disabling dot1x and see how it behaves when it is a normal access port with voice enabled. Another possibility is to try and break the stack and see how it behaves then. These are divide and conquer troubleshooting approaches, to try and limit the scope.

It would be nice to have the CPU process information, but I can see that this is a difficult one. Perhaps check the flash for crashdumps that could reveal some more information. Is a syslog server configured? Perhaps logging is still sent to this one while the console is being unresponsive.

Cisco CLI Analyzer could be used to check the configuration.

EduardR · ‎09-26-2019