Cisco 3850 16.12.3a POE issues

AdamF1 · ‎09-01-2020

Good morning,

Is anyone else running 16.12.3a IOS on 3850 switches?

Here is an issue we are facing but I can not find any documentation of a bug in this code and its still recommended as the code to go to...

Issue: POE Stops functioning on random ports but works on others. POE will not work for Avaya phones, cameras, cisco phones, or Cisco AP's ( 3602,3702,3802).

Work around: Reboot switch, downgrade, or find a port that will provide POE.

We began upgrading and testing on several stacks for a month or 2 with no issues prior to deploying to approximately 30 stacks of 3850's. After we did a mass deployment we began to see POE issues on switches that seem to be triggered when removing or adding a POE device. Once the condition has been triggered it will not go away until rebooted or downgraded. Logs will state " Controller port error, interface x/x/x, power given, but machine power good wait timer timed out.

I have found similar issues or bugs in older codes, have we regressed?

pieterh · ‎09-01-2020

did this happen after upgrade to 16.12.3a ?
in my opinion PoE is a hardware function, not directly OS related, but....
there may be a change in defaults for CDP/LLDP that can trigger your issue

I did encounter some incompatibility between Cisco Poe+ and an (heated) external camera housing,
that was not IOS version related but behaved diffently on different platforms
the initial drawn PoE current was too high so Cisco switch shuts down the port temporarely, this process loops
Vendor claimed this being an within the PoE+ standard unspecified behaviour how to handle this .....

AdamF1 · ‎09-01-2020

Yes it was after upgrading and I suspected that at first and actually RMA'd the switch but the issue started popping up weeks later across multiple sites.

Downgrading to 16.9.5 or upgrading to 16.12.4 or rebooting resolves the issue. I have not run either code long enough yet to see if it will happen again. I have had to do this to about 8 switch stacks so far.

These are similar bugs but state it was fixed in prior releases but as we all know the bug can regress and re-appear in later codes.

CSCvj76259

CSCvd46008

Leo Laohoo · ‎09-01-2020

The 3650/3850 has some design defects and one of them is affected by a hardware bug called MOSFET (which are the two bugs you've mentioned).

Hard reboot (pull the power) is only a workaround.

The only way to fix this is to RMA the appliance.

AdamF1 · ‎09-01-2020

Leo,

Thanks for the response.

I do not consider this to be a hardware issue when a reboot or if downgrading/upgrading resolves the issue. If it was a hardware issue it should pop right back up after performing the steps above. I would have also expected this bug to pop up in the many 3.x.x upgrades we have performed on these over the past 5 years. The issue has popped up on 3850 switches manufactured from 2013 through 2018 and only after upgrading to 16.12.3a. This code version also has a lot of SNMP issues that do not allow you to poll certain oids.

I am going to deploy 16.12.4 to a larger sample pool to see if the issue re-occurs or can be triggered.

AdamF1 · ‎09-30-2020

Well it was deployed to a larger sample and the issue appeared in a stack running 16.12.4 after 4 weeks.

Is no one else experiencing this issue?

What code is everyone else running on their 3850s?

Art Astafiev · ‎12-17-2020

This is known issue on 3850 in entire 16.12.x track up to last release 16.12.4. We have around 180pcs 3850 switches (3 different models) and seeing this a lot, Currently we are on 16.12.3a, but issue was reported by other people on 16.12.4. Issue is rare, but we see 1-2 switches per week doing this. Fix only reboot. Yesterday we opened new TAC case and keeping one switch with this issue until TAC will bring dev in and they will debug everything to final resolution. This is important to fix because 16.12.x track is only on bug fix support for just another 6 months, which means nobody will bother to fix anything in 2 months.

AdamF1 · ‎12-17-2020

I wouldn't say it was a known issue for several months as it didn't seem to be any information on it or people experiencing it. There was no documentation on this issue in 16.12.3 or 16.12.4 for several months and I still haven't seen them assign a bug id to it. I was shocked to see I was the only one on the forums reporting this issue besides one other person that messaged me, even he had issues getting TAC to say it was a bug or to give it an id. Our SA was digging through cases and could not find anything but RMA's for "power" issues. 16.12.x train is extremely buggy for the 3K line as I was told it was geared more for the 9K line. At the time of this post 16.12.3a was the recommended code for 3850s but they rolled that back quickly to the 16.9.x train and now the recommended code is 16.9.6. I have been running 16.9.5 on over a 100 switches with no POE issues and have started rolling out 16.9.6 for testing. The 16.12.4 code works great on the 9K's as well. Gotta love Cisco and their codes...

Did TAC give you a bug ID for this issue?

Art Astafiev · ‎12-18-2020

I just received two responds on my TAC case. I will list both one next to another

-------------------------------------------

Respond 1

---------------------------------------

I’ve been researching about this issue, and it was already identified that on 16.12.x trail there are some issues with PoE, this issue has been recreated and fixed. The fix will come with the 16.12.5 code release which is due out tentative on Jan22, 2021.

Please check this out:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvv50628

Cat3850 : PoE doesn't work - Power given, but State Machine Power Good wait timer timed out

CSCvv50628

Description

Symptom:

Switches and/or stack running versions Gibraltar 16.12.3 and Gibraltar 16.12.3a stop providing PoE on certain ports. This issue is seen after the following log is seen:

%ILPOWER-3-CONTROLLER_PORT_ERR: Controller port error, Interface Gix/0/y: Power given, but State Machine Power Good wait timer timed out.

The impacted ports could experience the following scenarios:

a) The PoE device will power up for a few seconds (5-45) and then it dies, there have been cases where the device powers up for up to 5 minutes.
c) No power at all is seen.
Disconnecting/reconnecting the cable or a shut/no shut the impacted port does not resolve the issue.

Conditions:

It looks like it is a matter of time (weeks) when the issue is seen.

Workaround:

So far the only workaround found is to reload the impacted stack/switch

Further Problem Description:

Once one single port reaches this status all the ports are likely to experience the same issue, this means: if a single port has this issue in stack, if any other port or ports is/are disconnected/re-connected these other ports will experience the same issue

Alternatively, a software patch (SMU) for 16.12.3a is tentatively planned for Oct 23, 2020

The SMU fix for this PoE issue is cat3k_caa-universalk9.16.12.03a.CSCvv28324.SPA.smu.bin. Please note that this patch is for the 16.12.3a release only:

https://software.cisco.com/download/home/284455380/type/286308587/release/Gibraltar-16.12.3a

More info regarding SMU upgrade procedures in general can be found here:

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3850/software/release/16-12/configuration_guide/sys_mgmt/b_1612_sys_mgmt_3850_cg/software_maintenance_upgrade.html

--------------

Respond 2

-----------------

I was looking at the TAC case, per the TAC engineer the fix will be on version 16.12.5 tentative release Jan 22, 2021. Also, there is a software maintenance upgrade (SMU) that was released to fix this issue, below is the link from where we can get the SMU to fix the PoE related issue. In the description of this SMU its mentioned that it is for SNMP, but it has been working fine for PoE issue as well. https://software.cisco.com/download/home/284455429/type/286308587/release/Gibraltar-16.12.4

File name: “cat3k_caa-universalk9.16.12.04.CSCvv28324.SPA.smu.bin”

This SMU will solve the issue with the %ILPOWER-3-CONTROLLER_PORT_ERR: Controller port error, Interface Gix/0/y: Power given, but State Machine Power Good wait timer timed out

Please see below the steps to make the install of the SMU:

Copy the SMU to the switch
And then use #install add file flash:<filename> active commit
Example: #install add file flash: cat3k_caa-universalk9.16.12.04.CSCvv28324.SPA.smu.bin active commit

It will ask you to reload the switch
After the reload it will be install
Use show install summary and in the state need to be "C" = Activated & Committed

AdamF1 · ‎12-18-2020

Those bug ids and smu fixes are for snmp issues according to the links and descriptions.

Art Astafiev · ‎12-28-2020

Looks like this SMU doesn't fix this bug. We didn't yet upgrade our switches (waiting for 16.12.5), but I received respond on my other post covering same issue. Person who did upgrade said that bug is back after two weeks. Here is link to other post. Now I have doubts that 16.12.5 will fix this issue.

https://community.cisco.com/t5/cisco-bug-discussions/cscvf33653-controller-port-error-power-given-but-state-machine/m-p/4261063#M11715

markayash · ‎10-12-2020

I never had issues until I upgraded either..I am backing out this upgrade to fix that and SNMP issues

AdamF1 · ‎10-13-2020

Thanks Markayash for your feed back. I hope Cisco admits to these major issues in 16.12.x soon for the 3850 platform.

Leo Laohoo · ‎02-04-2021

CSCvv54912

Art Astafiev · ‎02-05-2021

16.12.5 was released. We will do upgrade next week - fingers crossed. I will update on results - in our environment I should see issue within 3 couple weeks if it was not fully fixed.