07-29-2018 04:51 PM - edited 03-08-2019 03:46 PM
Hi all,
We've spent days troubleshooting a very odd problem we're having with our network, which has brought me here for some more suggestions, because we're completely out of ideas now. Our network consists of a core stack of Cisco 3850's and an edge that's mostly 2960's of some variety, there are also Cisco 5520 wireless controllers in the mix. The software versions at the edge vary but the 3850's run Denali v16.3.5b.
The problem is simply that some IP addresses intermittently don't work. The address either isn't able to get out of it's own subnet, or in some cases can communicate to some hosts outside it's subnet and not others. Usually, the IP address is in this useless state for the larger part of a day and then starts working again.
We've spent ages troubleshooting this, just trying to even narrow it down to something specific, here's what we've eliminated - It happens regardless of DHCP or a static IP, it's independent of OS (affects Windows and Linux), it's independent of switch (if you move a problematic host to a different switch - say from the edge to the core, the problem follows it). It's not due to an ACL on the 3850's, there are a number of ACL's but affected IP's aren't listed in the config at all. The IP's aren't contiguous and aren't limited to one VLAN/Subnet, it affects at least two subnets. It's not due to duplicate IP's. It affectes wired and wireless clients. The logging on the switches shows nothing about these hosts, either by mention of IP or MAC. I don't think it's a problem with the ARP cache, or at least clearing the ARP cache on the switches doesn't seem to help.
The only thing we've found so far that does help is to shut/no shut the gateway interface on the 3850's for the affected VLAN. The affected IP works straight away after this, but new bad IP's show up soon after.
I hope I've given enough information, I can post switch configs, but haven't yet, as I'd have to review the config first for security reasons.
Our network isn't hugely complicated, just the physical topology I described, with very basic configs on the edge switches, any routing and ACL's are handled by the 3850 stack.
Any help would be greatly appreciated. We're so stumped we don't even know how to troubleshoot it any further. Please let me know if you need any further info, I've tried to avoid turning this post into a novel, so may have missed something.
Thanks,
Hal.
07-29-2018 07:05 PM
07-29-2018 07:32 PM
07-29-2018 10:35 PM
Thanks for the pointer on arp debugging. We had a bad IP or two to test today so we enabled it and if I understand the logging correctly, the ARP requests are being replied to:
8232723: Jul 30 15:19:04.527 AEST: IP ARP: rcvd req src 10.3.0.100 1803.73c9.bec0, dst 10.3.0.254 Vlan3 8232724: Jul 30 15:19:04.528 AEST: IP ARP: sent rep src 10.3.0.254 bc67.1c1b.fee7, dst 10.3.0.100 1803.73c9.bec0 Vlan3
10.3.0.100 is an example IP with our problem.
Am I understanding the debugging correctly?
Thanks.
07-30-2018 12:14 PM
07-29-2018 07:14 PM
Hi,
This problem occurs with specific hosts only or it is dynamic ?
Do you have hubs or switches connected at edge switches ?
You have one 3850 or two ?
Is the 3850 the root bridge of all your vlans ?
Can you share an access port config ?
07-29-2018 07:39 PM
Luciano,
Thanks for the quick reply. In answer to your questions:
No, it's dynamic, there is no pattern to the hosts it affects.
There's no other switches/hubs connected to the edge switches (remember this affects wireless clients too).
It's a stack of 3850's - 3 units sharing one config.
Yes, the 3850's are the root of the VLAN's.
An access port config looks like:
interface GigabitEthernet1/0/10 switchport access vlan 4 switchport voice vlan 20 srr-queue bandwidth share 10 10 60 20 queue-set 2 priority-queue out mls qos trust device cisco-phone mls qos trust cos auto qos voip cisco-phone spanning-tree portfast service-policy input AutoQoS-Police-CiscoPhone
Many thanks,
Hal.
@luciano_dj wrote:
Hi,
This problem occurs with specific hosts only or it is dynamic ?
Do you have hubs or switches connected at edge switches ?
You have one 3850 or two ?
Is the 3850 the root bridge of all your vlans ?
Can you share an access port config ?
07-30-2018 12:03 AM
A few things that I can think of:
If it happening frequently. I suggest running a wireshark so you can see packet that is coming in and out during the issue and before the issue will occur.
Also, check port status and security?
07-30-2018 12:03 PM - edited 07-30-2018 12:05 PM
Hi,
"The only thing we've found so far that does help is to shut/no shut the gateway interface on the 3850's for the affected VLAN. The affected IP works straight away after this, but new bad IP's show up soon after."
(Assuming there are no bugs in the code that may cause this issue).
> When the issue occurs, are you able to ping affected hosts with in the same vlan (from other hosts in the same vlan)?
> Can you post vlan interface confis from core switch for couple of vlans that shows this issue?
Thanks,
MS
07-10-2019 07:00 AM
Is there any solution identified on this issue ? It seems we are seeing same behaviour. We are running code 16.3.7.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide