Re: Nexus 7000 trunking issue vlan being removed

pwilliams86 · ‎01-19-2018

Hi,

We have a Nexus 7000

Software
BIOS: version 3.1.0
kickstart: version 6.2(16)
system: version 6.2(16)
BIOS compile time: 02/27/2013
kickstart image file is: bootflash:///n7700_s2_kickstart.6.2.16.bin
kickstart compile time: 1/27/2016 9:00:00 [04/05/2016 21:11:19]
system image file is: bootflash:///n7700_s2_dk9.6.2.16.bin
system compile time: 1/27/2016 9:00:00 [04/05/2016 22:15:44]

Hardware
cisco Nexus7700 C7706 (6 Slot) Chassis ("Supervisor Module-2")
Intel(R) Xeon(R) CPU with 32745060 kB of memory.
Processor Board ID JAE184403VZ

We recently connected a switch to port e1/48. The uplink is set as a trunk at both ends, all devices on the switch are in vlan 10, however the config on the nexus was as follows:

interface Ethernet1/48
description link-id-10
switchport
switchport mode trunk
no shutdown

so no specific switchport trunk allowed vlan.... command.

The link came up and we were able to ping devices on the switch. Then 5 minutes later we noticed the pings dropping 100%. #show cdp neighbor showed the switch at the other end but #show int trunk showed that there were no vlans allowed for int e1/48. I then did a #show run int e1/48:

interface Ethernet1/48
description link-id-10
switchport
switchport mode trunk
switchport trunk allowed vlan none
no shutdown

I then specifically added vlan 10 to the interface trunk, shut, no shut, but the layer 2 link wouldnt come back up and we had to back out of the change (this is a live environment). There were no logs for this either. I know it is best practice to specifically state what vlans are allowed, normally we do, but what would cause the link to come up and work only for it to then stop working a few minutes later?

Many thanks,

P

Peter Paluch · ‎01-19-2018

Hi,

To my best knowledge, there is no automatic component of NX-OS that would, based on its own decision or whims, add the switchport trunk allowed vlan none to an interface.

Is it possible to revisit the show accounting log output? Personally, I would check very carefully whether the switchport trunk allowed vlan none is mentioned there, and if so, what was the user ID with which this command was added.

Also, have there been any logs, no matter how apparently insignificant, recorded in show logging log around the time of configuring the interface, and the unexpected addition of that command?

Best regards,
Peter

pwilliams86 · ‎01-19-2018

Thank you, the last config changes that were made to that port (before we patched into it today) was a month ago and were

Fri Dec 15 14:28:32 2017 interface Ethernet1/48 ; no switchport (REDIRECT)
Fri Dec 15 14:28:32 2017 interface Ethernet1/48 ; no switchport (SUCCESS)
Fri Dec 15 14:28:43 2017 interface Ethernet1/48 ; switchport (REDIRECT)
Fri Dec 15 14:28:43 2017 interface Ethernet1/48 ; switchport (SUCCESS)
Fri Dec 15 14:28:47 2017 interface Ethernet1/48 ; switchport mode trunk (REDIRECT)
Fri Dec 15 14:28:47 2017 interface Ethernet1/48 ; switchport mode trunk (SUCCESS)
Fri Dec 15 14:28:57 2017 interface Ethernet1/48 ; switchport trunk allowed vlan add 10 (REDIRECT)
Fri Dec 15 14:28:57 2017 interface Ethernet1/48 ; switchport trunk allowed vlan add 10 (SUCCESS)

So when we patched into it there was actually a command saying allow vlan add 10 (although the running config didnt show it). No one has added a command switchport trunk allowed vlan none at any point. It seems the Nexus has added it between us patching the fibre in and the pings then dropping (the only two people who can log into the Nexus are myself and a colleague who was sat next to me, both scratching our heads).

If a trunk is being set up from scratch, like above, should the command initially be switchport trunk allowed vlan 10 and not switchport trunk allowed vlan add 10? Possibly it hasnt liked that? Although I cant believe Cisco would have not fixed that years ago.

Many thanks,

P

Peter Paluch · ‎01-19-2018

Hello,

Thank you for the clarification!

The effect of configuring switchport trunk allowed vlan add 10 on a trunk that already has all VLANs allowed is that the configuration does not change at all: You only allow a VLAN that is already allowed, so effectively there is no change at all. Configuring it would not cause any harm, and it is also expected that under the circumstances, this command would not show in the interface configuration as it implicitly (by means of all VLANs allowed) already there.

This is getting interesting. By any chance, do you have a config-sync configured between this switch and a peer N7K?

Also, you said that after you noticed the "allowed vlan none" command added, you tried to "specifically added vlan 10 to the interface trunk, shut, no shut, but the layer 2 link wouldnt come back up ". What does this mean exactly? Can you elaborate on this?

Thank you!

Best regards,
Peter

pwilliams86 · ‎01-19-2018

Hi Peter,

Excitement over I think, you were right about checking what config had been entered.

The fibre was patched in, link came up and was working.
1. 2018 Jan 19 09:30:04 KHM22N04-KGH-MDP2 %ETHPORT-5-IF_HARDWARE: Interface Ethernet1/48, hardware type changed to 1G
Then a few minutes later my colleague tried to tidy up the trunk with
1. Fri Jan 19 09:32:25 2018 Ethernet1/48 ; switchport trunk allowed vlan 10 (SUCCESS)
2. Firstly this wasnt needed, secondly it had the effect that the link stopped working - that I cant explain, should adding vlan 10 to a trunk that already has vlan 10 added cause an issue?
With devices now not pingable he has tried to back out with
1. Fri Jan 19 09:35:30 2018 Ethernet1/48 ; no switchport trunk allowed vlan 10 (SUCCESS)
2. To remove a vlan from a trunk is switchport trunk allowed vlan remove 10, not no switchport.... I think that is where the Nexus has added switchport trunk allowed vlan none. Does that sound right?
I have then logged on to the nexus and seen switchport trunk allowed vlan none

Am I right?

I can then see that after reconfiguring the port to allow vlan 10, I bounced the interface for good measure, between me re-enabling the interface and us backing out of the change was 38 seconds. That should have been long enough for a 1Gbps link to start passing traffic again(?) but pings were still dropping, downtime had been 8 minutes so we couldnt really hold on any longer. I wonder if we had left it for 60sec it would have sorted itself out.

Peter Paluch · ‎01-19-2018

Hello,

You're spot on: Under NX-OS for N7K, removing the whole switchport trunk allowed vlan vlan-list command using the no keyword will result in the switch disabling ALL VLANs on the trunk, and understandably placing the switchport trunk allowed vlan none into the port's configuration. This is different from IOS behavior on Catalyst switches, and also from other Nexus platforms, such as N9K.

Then a few minutes later my colleague tried to tidy up the trunk with [switchport trunk allowed vlan 10]. Firstly this wasnt needed, secondly it had the effect that the link stopped working - that I cant explain, should adding vlan 10 to a trunk that already has vlan 10 added cause an issue?

To VLAN 10, there should have been no perceptible change whatsoever. Understandably, however, all other VLANs would become blackholed on this trunk. Was the connectivity truly provided in VLAN10 - could that have been a different VLAN that got pruned off this trunk as a result of the added command?

I also wonder if STP could have had anything to do with this. What kind of STP are you running - is it Rapid-PVST+, or is it MST?

With devices now not pingable he has tried to back out with no switchport trunk allowed vlan 10

Right - and that would have caused the switch to take the unexpected action and disable all VLANs on the trunk. I have just now tested the behavior on a N7K running 7.2(2)D1(2) and I can 100% confirm the behavior: Removing the whole switchport trunk allowed vlan vlan-list command with the no keyword is equivalent to configuring switchport trunk allowed vlan none. I can also confirm that a N9K NX-OS 7.0(3)I7(1) behaves in the expected way - it reverts back to allowing all VLANs. Definitely a quirk of the N7K NX-OS in particular.

I can then see that after reconfiguring the port to allow vlan 10, I bounced the interface for good measure, between me re-enabling the interface and us backing out of the change was 38 seconds. That should have been long enough for a 1Gbps link to start passing traffic again(?) but pings were still dropping, downtime had been 8 minutes so we couldnt really hold on any longer. I wonder if we had left it for 60sec it would have sorted itself out.

There seems to be something else going on. 38 seconds would be enough for the link to become forwarding in STP even if STP had to go through the full Discarding->Learning->Forwarding sequence (that would take 30 seconds if you are running on default STP timers). I wonder if the logs on the device attached to e1/48 could shed some more light to this mystery.

Let me ask a set of questions:

What kind of device is connected to your e1/48? I believe you've mentioned it is a switch. Can you clarify what exact type and perhaps what OS version that device is running?
Once again, what kind of STP are you running between this N7K and the switch connected to e1/48?
Is there any chance that the connectivity was actually occurring in a different VLAN than VLAN10?
Are you using VTP, and VTP Pruning in particular?

Thank you!

Best regards,
Peter

pwilliams86 · ‎01-19-2018

Hi Peter,

What kind of device is connected to your e1/48? I believe you've mentioned it is a switch. Can you clarify what exact type and perhaps what OS version that device is running?
1. It is a cisco WS-C3750G-24TS-1U
2. Cisco IOS Software, C3750 Software (C3750-IPBASEK9-M), Version 15.0(2)SE4, RELEASE SOFTWARE (fc1)
3. It is a remote site switch, it runs through a ISP but it is a layer 2 link, hence being able to see the switch from the Nexus using #show cdp neighbor even when the vlan stuff was not working. What is strange is that to get the link working we have to piggy back vlan 10 through another trunk to an end user switch stack (c3850's) and then we connect into the NTU of our provider using gig1/1/2. What we are wanting to do is connect the Nexus directly to the NTU. The interface config for the 3850 is
4. interface GigabitEthernet1/1/2
  switchport trunk allowed vlan 10
  switchport mode trunk
  and that is it (apart from a description). It cant be a vlan mismatch because otherwise it would also fail on that interface.
Both sides, infact all three devices are running Rstp so the 38 seconds is being even more generous. By chance I had done a packet capture on the link the other day and I can see that the STP messages are correct for the configuration, its not like the remote switch is failing.
I dont think we are using vtp, its been years since I configured it but #show vtp status said disabled for pruning and #show vtp counters showed 0 for all interfaces.

Finally there are no logs on the remote switch for this work and no logs on the nexus for any vlan issues.

I appreciate your help with this,

Paddy