Re: Bizarre IP Address Failures

speculator · ‎07-29-2018

Hi all,

We've spent days troubleshooting a very odd problem we're having with our network, which has brought me here for some more suggestions, because we're completely out of ideas now. Our network consists of a core stack of Cisco 3850's and an edge that's mostly 2960's of some variety, there are also Cisco 5520 wireless controllers in the mix. The software versions at the edge vary but the 3850's run Denali v16.3.5b.

The problem is simply that some IP addresses intermittently don't work. The address either isn't able to get out of it's own subnet, or in some cases can communicate to some hosts outside it's subnet and not others. Usually, the IP address is in this useless state for the larger part of a day and then starts working again.

We've spent ages troubleshooting this, just trying to even narrow it down to something specific, here's what we've eliminated - It happens regardless of DHCP or a static IP, it's independent of OS (affects Windows and Linux), it's independent of switch (if you move a problematic host to a different switch - say from the edge to the core, the problem follows it). It's not due to an ACL on the 3850's, there are a number of ACL's but affected IP's aren't listed in the config at all. The IP's aren't contiguous and aren't limited to one VLAN/Subnet, it affects at least two subnets. It's not due to duplicate IP's. It affectes wired and wireless clients. The logging on the switches shows nothing about these hosts, either by mention of IP or MAC. I don't think it's a problem with the ARP cache, or at least clearing the ARP cache on the switches doesn't seem to help.

The only thing we've found so far that does help is to shut/no shut the gateway interface on the 3850's for the affected VLAN. The affected IP works straight away after this, but new bad IP's show up soon after.

I hope I've given enough information, I can post switch configs, but haven't yet, as I'd have to review the config first for security reasons.

Our network isn't hugely complicated, just the physical topology I described, with very basic configs on the edge switches, any routing and ACL's are handled by the 3850 stack.

Any help would be greatly appreciated. We're so stumped we don't even know how to troubleshoot it any further. Please let me know if you need any further info, I've tried to avoid turning this post into a novel, so may have missed something.

Thanks,

Hal.

Francesco Molino · ‎07-29-2018

Hi

Can you try please you upgrading the switch to 16.3.6 because a bug regarding no answering arp?

When the issue occurs again, can you run a debug arp to see if you're facing the same issue?

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

speculator · ‎07-29-2018

Hi Francesco, do you have a reference for that bug? I haven't seen anything on it.

speculator · ‎07-29-2018

Thanks for the pointer on arp debugging. We had a bad IP or two to test today so we enabled it and if I understand the logging correctly, the ARP requests are being replied to:

8232723: Jul 30 15:19:04.527 AEST: IP ARP: rcvd req src 10.3.0.100 1803.73c9.bec0, dst 10.3.0.254 Vlan3
8232724: Jul 30 15:19:04.528 AEST: IP ARP: sent rep src 10.3.0.254 bc67.1c1b.fee7,
                 dst 10.3.0.100 1803.73c9.bec0 Vlan3

10.3.0.100 is an example IP with our problem.

Am I understanding the debugging correctly?

Thanks.

Francesco Molino · ‎07-30-2018

Yes you're understanding is correct.
Here is the bug I thought: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvg37755

Also, can you share a design please? Just to make sure, you said that you're facing this problem even if the host is connected directly on core switch?

Thanks
Francesco
PS: Please don't forget to rate and select as validated answer if this answered your question

luciano_dj · ‎07-29-2018

Hi,

This problem occurs with specific hosts only or it is dynamic ?

Do you have hubs or switches connected at edge switches ?

You have one 3850 or two ?

Is the 3850 the root bridge of all your vlans ?

Can you share an access port config ?

speculator · ‎07-29-2018

Luciano,

Thanks for the quick reply. In answer to your questions:

No, it's dynamic, there is no pattern to the hosts it affects.

There's no other switches/hubs connected to the edge switches (remember this affects wireless clients too).

It's a stack of 3850's - 3 units sharing one config.

Yes, the 3850's are the root of the VLAN's.

An access port config looks like:

interface GigabitEthernet1/0/10
 switchport access vlan 4
 switchport voice vlan 20
 srr-queue bandwidth share 10 10 60 20
 queue-set 2
 priority-queue out
 mls qos trust device cisco-phone
 mls qos trust cos
 auto qos voip cisco-phone
 spanning-tree portfast
 service-policy input AutoQoS-Police-CiscoPhone

Many thanks,

Hal.

@luciano_dj wrote:

Hi,

This problem occurs with specific hosts only or it is dynamic ?

Do you have hubs or switches connected at edge switches ?

You have one 3850 or two ?

Is the 3850 the root bridge of all your vlans ?

Can you share an access port config ?

Vince · ‎07-30-2018

A few things that I can think of:

If it happening frequently. I suggest running a wireshark so you can see packet that is coming in and out during the issue and before the issue will occur.

Also, check port status and security?

mvsheik123 · ‎07-30-2018

Hi,

"The only thing we've found so far that does help is to shut/no shut the gateway interface on the 3850's for the affected VLAN. The affected IP works straight away after this, but new bad IP's show up soon after."

(Assuming there are no bugs in the code that may cause this issue).

> When the issue occurs, are you able to ping affected hosts with in the same vlan (from other hosts in the same vlan)?

> Can you post vlan interface confis from core switch for couple of vlans that shows this issue?

Thanks,

MS

rojesara.prashant · ‎07-10-2019

Is there any solution identified on this issue ? It seems we are seeing same behaviour. We are running code 16.3.7.