Re: Troubleshooting EtherChannel!

Eric R. Jones · ‎10-26-2018

Hello all we are having a devil of time with some etherchannel shenanigans.

We have a VSS pair of 6509's connecting to some 3850's.

We have port-channeling enabled between the edges and the 6500's.

We are making our changes to the virtual/logical port-channel as required.

We recently updated 22 3850's to 16.9.1 from 16.8.1a.

The switches ran for about 12-15 hours when they decided to drop connectivity to the 6500's.

They did this intermittently meaning that one edge (3850) was running but we were not able to manage it and users were disconnected from the Network for about 4 hours. After 4 hours the switch just regained connection without us doing anything to either end. It remained up for 3 hours and then dropped off the network for 5 minutes or so and came right back up.

We found a switch with no users and began TShooting it.

In no particular order here we did methodically move connections on the edge from one Gig port to other.

Moved SFP modules, changed the ports used in the port channel, rebuilt the port channels from scratch on both ends of the link.

Each time we would see the link attempt connection but ultimately fail.

What got us limping was changing the physical port on the edge side, disabling the channel-group on that link and shutting down the other port.

This got us connected single threaded.

later we tried swapping the fiber then swapping it back and then finally swapping the SFP modules.

At this point we regained sync on the port-channel and all was well.

We returned the next day to try and replicate the failure and fix but it failed and we are currently single threaded again.

I have been reading up on Etherchannel and how to TShoot this but has anyone ever faced this issue?

Yes we compared configs between the failing edges and the ones not affected. They are the same.

We haven't rolled back to the old IOS due to security issues that made us roll forward in the first place.

We also haven't rolled forward into the next IOS version.

We have opened a case with TAC but I thought I would reach out and see what ideas everyone else has.

ej

Leo Laohoo · ‎10-26-2018

First, I'm going to say that the IOS version, 16.9.1, could be the culprit.
Next, I'm also going to say that is is probably a known bug.
Just want to make sure you're on a 1 Gbps link and not a 10 Gbps/SFP+, right? And the SFP module is a dedicated SFP module (C3850-NM-4-1G) and not an SFP/SFP+ (C3850-NM-4-10G)?
Can you try downgrading to, say, 16.3.7?

Eric R. Jones · ‎10-26-2018

Interesting, you say this could be related to 16.9.1 and could possibly be a known bug. We have found nothing in general Google searches nor has Cisco admitted to a possibly bug yet...

That being said we would have to get permission to down grade as we are on a tight IAVA/IAVM watch and we went to 16.9.1 to mitigate an IAVM related to 16.8.1a.

As for the modules we are using 1G SFP and some are using a different SFP type that allows for a single strand of fiber for both sides. None of the edges with the new SFP module appear to have an issue. We have the NM-4-1G module and tried swapping it with another one but no change. The one constant we do see is that g1/1/1 seems to be the port that refuses to link under port-channel conditions. This failure happened when we swapped the Gig Module.

ej

Eric R. Jones · ‎11-02-2018

Update - We found that on all the switches where 16.9.1 was installed one of the port-channels was suspended. In 22 cases only 6 switches had failures affecting both ports in the port-channel the others just failed on one side. We experimented with using On-On and Desirable-Desirable. In both cases the connections came up properly; however, we do see "unknown protocol drops". Reviewed the release notes and nothing mentioned about port channels except that MACsec is now allowed on L2, our configuration, and L3 port-channel configurations.

ej

Eric R. Jones · ‎11-11-2018

We finally got a WebEx session going with Cisco TAC and we confirmed with them that it's not a configuration issues on our side and that traffic flows between Access and distribution switches by changing from LACP Active/Active, On/On to PAGP Desirable/Desirable and modes in between. They stated that it appears to be a software issue and have gone to the Lab to see what they can do. I still haven't run across anyone else with this issue but you would have to be doing port-channels with dual home on your distribution switch to see it.

We are awaiting the results.

ej

Sheikh Islam · ‎01-29-2019

Hi,

I have had the same issue when I upgraded one of our C3850 to 16.9.1 - it broke the port channel. I have been asked by TAC to check the config on both sides and make sure they are same. It was same.

That port channel is still broken. I had been in touch with TAC and they gave me some ideas. But before I could arrange a downtime to test some more, iso 16.9.2 was released.

I have installed 16.9.2 on couple of my spare c3850s yesterday and had a c3750x connected to these via fibre on a 1g sfp and 1g module.

Tried to break the etherchannel - could not recreate the problem. With the limited time that I get to play around with this, I dont think I will be able to downgrade one of these switches to 16.9.1 and see if I can recreate the problem but will try.

My ehterchannel port config is as follows -

description XX

switchport trunk allowed vlan X,X
switchport mode trunk
switchport nonegotiate
storm-control broadcast level 0.25
storm-control action trap
channel-group 1 mode active

Regards,

Sheikh