Cisco 3850 VTY (SSH) lines hung by Prime - bug

cammaher · ‎11-05-2015

Hi,

It seems we have run into this bug with our deployment of the 3850 switches. Bug ID is: CSCuv84149

We are currently running IOS XE version 3.03.03 and have had 3 switch stacks with this bug in the last 2 weeks and we are now concerned that it will continue to happen.

What experience have other people had with this issue? Did you do an upgrade or have found another way to resolve? Whicih version did you upgrade to? 3.7.1E or the recommended release 3.6.3E?
We are hesitatnt to upgrade as it is a major disruption to our clients and we have lots of switch stacks which makes it a large parcel of work.

I look forward to hearing what other people have done.

Thanks,
Cam

Reza Sharifi · ‎11-05-2015

Hi,

Here is what I recommend. Open a ticket with TAC to make sure in fact 3.7.2 will resolve the issue. If yes, load the image on one of your stacks and run it for a while to make sure the issue has been resolved before you upgrade the rest of your infrastructure. If the issue is not resolved with this version, ask TAC for further help.

HTH

mbulfer2001 · ‎07-18-2016

We are running 3.6.3 and we are still having this issue. Running about 400 - 3850 stacks in the network and see about 6-12 per month where we have to manually power cycle the stacks to resolve this.

casey.gilbert · ‎02-15-2017

Does anyone have a concise resolution for this issue? Long story short I've been through 3-4 iOS upgrades over the past 18 months and experience vty exhaustion on 3.7.3 from prime 3.0.x every 10 weeks or so. I've disabled the wireless configuration audit background task in prime which hasn't helped. I'm getting the run around from tac can't trust their work as the root cause has never been found.

I need to move from 3.7.3 as I'm seeing high CPU on core 0. Unsure if downgrading to the current tac recommended iOS 3.6.6 is worthwhile or just move to 3.7.4.

regards Casey

burnettg_98 · ‎04-13-2017

We just saw it in both 3.6.3 and 3.7.2. TAC told us to go to 3.6.5 and/or 3.7.5.

Brad Walker · ‎11-06-2015

This popped up on two 03.06.03E 3850 switch stacks recently. The "show run" command causes hung sessions. PI apparently polls configs often.

Apparently when the VTYs become exhausted, EEM scripts fail.

2015-10-08T20:01:18.306 10.10.10.10 <187>29620: Oct 8 2015 20:01:17.307 : %HA_EM-3-FMPD_CLI_CONNECT: Unable to establish CLI session: 'Embedded Event Manager' detected the 'fatal' condition 'no tty lines available, minimum of 2 required by EEM'
2015-10-08T20:01:18.307 10.10.10.10 <187>29621: Oct 8 2015 20:01:17.307 : %HA_EM-3-FMPD_ERROR: Error executing applet default-acl-recovery statement 1.0

Brad Walker · ‎11-12-2015

Update:

See CSCuw53025. Workaround: monitor syslog for "Error, ECI has run out of event blocks" and pull affected systems out of NMSs likely to run "show run". This buys time for a remote reload.

Zachary McGibbon · ‎11-12-2015

We have also run into this bug however it wasn't prime causing the problem but it was APIC-EM chewing up all the VTY lines.

As some of our switches are far away, we found that snmp was still working and we were able to reboot the switches by using snmpset commands (as we have a write community setup)

Here is a good explination on how to do it:

http://www.ciscozine.com/send-cisco-commands-via-snmp/

mark.ozga · ‎01-05-2018

The following command may help throttle the number of simultaneous VTYs your policies/applets/scripts are pooling from:

event manager scheduler applet thread class default number 1

Brad Walker · ‎11-16-2015

Update: I've identified the trigger of this condition on our network as PI's "Wireless Configuration Audit" background task which executes on 3850 stacks. All "Error, ECI has run out of event blocks" syslog events over months happened while that task executed. It was our only NMS task scheduled for that timeframe. When the task was rescheduled, the syslog events started appearing during the new timeframe. Disabling the task should prevent triggering the condition, regardless of IOS code versions.

Other NMS tasks that exercise WLC functionality on 3850s (e.g. APIC-EM?) may also trigger this.

Feds · ‎01-14-2016

Hello, we have this issue too on ver 3.7.0. Another bug id related to this issue is CSCuv69297. Cisco TAC is telling us to downgrade to 3.6.3 . Apparently 3.7.2 is also not affected.

In our case Rancid seems to trigger the bug.

I'm trying to find a way to prevent ssh sessions or vty from hanging by using a timeout or something similar, not sure if it's possible. Do you have any clue? Apparently "timeout-exec" and "timeout login response" in vty config are not useful.

Also, the fact that either "disconnect ssh vty .." or "clear tcp tcb ..." options fail to clear the hung sessions is annoying and may be another bug itself.

Physically power cycling the Active member of the stack clears all the lines which is slightly better than reloading the entire stack. But this is not always possible since it requires to be onsite and disconnect power stack cables as well. I'm not sure if by reloading only the Active member from CLI clears the lines. I'll try next time.

Cheers

F

Feds · ‎07-18-2016

Hello,

I can confirm that by soft reloading the Active member (reload slot #) lines get cleared.

However after upgrading all 200+ stacks to 3.7.2 and some to 3.7.4 a couple of months ago this issue has disappeared.

Cheers
F.

robertbrink1 · ‎01-09-2017

I've also seen this problem a few times. (on 3.6.1 and 3.6.2)

Since I am not able to run out on different locations and pull the power off and on, we use snmpset to reload the switch. "snmp-server system shutdown" has to be enabled to get this working. You can turn this on with the web gui cli if your switch refuses connection on telnet/ssh. :)

clear line and clear tcp tcp does not work.

Tool used: https://www.snmpsoft.com/cmd-tools/snmp-set/

Doc: http://www.cisco.com/c/en/us/support/docs/ip/simple-network-management-protocol-snmp/26010-faq-snmpios.html#qa4

snmpset.exe -r:IP -c:"RW-COMMUNITY" -o:.1.3.6.1.4.1.9.2.9.9.0 -tp:int -val:2

Feds · ‎01-09-2017

Nice one robertbrink1, I'll give it a try next time it happens.

Tausif Gaddi · ‎01-09-2017

Issue is seen in Catalyst 3850 running 3.7.1 release.

Workaround:
Reloading the switch/stack is a temporary solution and issue may reoccur.

Issue has not reoccurred after upgrading C3850 to 3.7.2.