I am having some very strange issues on my 3850 stack. I have certain IPs that can acces all networks locally but cannot access anything on remote networks. These IPs seem to have no relation to each other. IP 10.2.3.54 has issue while .49 does not. .20 has the issue while .21 does not.
Another thing that is very strange is that when this issue is happening they cannot ping remote gateways that are on the same network. For instance client can acccess all hosts and gateway of 192.168.100.1 but cannot access the wan gateway at .235
Most of this has been intermittent clients but now we have 2 servers that are affected on a different subnet but their issue is intermittent but displaying the exact same issues.
I ruled out this been a computer issue by changing known good machines to use the same suspect IPs and they all have the same issue. I feel like this is some sort of PBR gone crazy.
Has anyone heard of these issues with 3850's? It just seems like when the issue happens the packets cannot leave the switch.
Looks like I am having the same kind of issues on version 03.02.02.SE.
Some users can work but others cannot within the same subnet. Did the upgrade fix the issues?. I noticed the following errors as well on SYSLOG.
Feb 5 20:58:07.023: %NGWC_COMMON-1-WDOG_CPUHOG: 1 fed: CPU usage time exceeded
369330000 msecs. -Traceback=1#f01c936775c649e7aee9def72bf33d1d pthread:2EA890
00+C450 :10000000+801478 :10000000+7657C0 :10000000+5A8F70 :10000000+9E4E74 :100
00000+9DAC98 :10000000+9DB4A0 :10000000+9DBA68 :10000000+9DC300 :10000000+9F2680
The setup worked perfectly fine for 3 months on 03.02.02.SE.
Sorry i did not follow up with resolution. I ended calling Cisco Tac on this issue. Did not get any responses on the forums for a couple of weeks but i do see someone essentially said the same thing TAC said which is 3.02.0 code was very buggy and said that he had seen this type of strangeness but did say it was not documented as a bug...yet.
What happened to us was we deployed the 3850s and like you it was up and running with no issues except for some pbr problems. A few months later i had users that could not seem to get out of the switch. It seemed that they could get to anything that was on that switch but could bot get to neighboring switches on the same vlan or exit via router to external sites. It was very weird. Also it was just 1 server one day, then another server another day. The the user stack started havign the same issues and started with 1 user then spread.
Upgrade fixed all issues. There is a new version out since i did our upgrades so you may want to research. New ver is 3.03.01
First of all thank you for responding so fast. We are going ahead with the upgrade to 3.3.1 with hopes that this will sort out the issues. I am having a difficult time to convince myself that this is a switch code level issue.
The only explanation what I have is that the 3850s are not responding to the ARP queries properly resulting in such strange behaviour. I will come back and post my story as soon as the upgrade is done.
Hello Jason and Dinesh,
The symptoms that you are seeing are due to bug CSCug87540 which is a major bug in 3.2.2 and below. I highly recommend getting up to the latest release which is 3.3.1 at the moment (3.3.2 should be out this month). The fix for 87540 is integrated into 3.2.3, but again, I would recommend moving to the latest stable release. The only workaround to your issue is a reload once its in the broken state. Also, just a reminder, 3.3.0 is a major release (feature release) which includes new features such as HSRP, 9 switch stacking, embedded packet capture to name a few.
If you have any questions or concerns, please feel free to post them here or message me. Thanks
3850: traffic L3 routed on 1 switch/member fails for newly added devices
Thanks for the response and clarification on the bug details.
I did the upgrade few days back and that fixed all the issues I was facing. I upgraded to version 3.3.1.
Couple if questions on the roadmap for 3850 though.
1. Is there a roadmap for ISSU for. 3850?. I have a completely redundant network where all my servers are dualhomed to my 3850 stack. Is there anyway I can avoid a full stack reload and minimize the downtime during an upgrade?
2. Is there a roadmap for flexible netflow on Vlan interfaces for both input and output traffic?
Sent from Cisco Technical Support iPhone App
Richard, this thread sounds exactly like the issue I've been working on with a customer for weeks. However their 3850 stacks are running 03.03.04
I am running 03.03.01SE and so far after the upgrade and I never had any issues. My uptime on these devices are
uptime is 1 year, 5 weeks, 5 days, 20 hours
The issue however was one of the weirdest ones I have encountered in my career. Many devices within same VLAN works where other don't get a ARP response from the 3850s. I am not sure if you see the same symptoms on the wireshark on the newer codes. Its very unlikely to have the same bug resolved and re-appear in a latest code.