You probably already ruled

brnhornt · ‎05-25-2016

Hi Folks,

Interesting issue I am troubleshooting. Back story:

Layer 3 Core switch with single fiber trunk ports heading out to multiple Layer 2 access/idf switches.

Hosts connected to various access/idf switches are complaining about loosing connectivity for 10-20 seconds at a time.

Pinging the end user's IP address is successful from all of the access switches...however pings fail from the core switch. After so many seconds...pings return to normal from the core as well. So an access switch on a different floor, which has to traverse thru the core swtich, can ping the address just fine during times that the core cannot.

No logs or issues about Spanning Tree. During an "outage" the spanning tree on the core remains the same as during successful pings.

No issues with MAC addresses or ARP. During an "outage" the end point's mac address and arp entry are still correct in the core switch.

Any ideas on what to check next. This problem appears to be impacting only a handful of users in a site of 200-some.

Iulian Vaideanu · ‎05-25-2016

Do all those handful of users suffer from the issue simultaneously? Are they connected to the same switch, or in the same vlan? When the issue occurs, can they ping one another?

Is the addressing scheme something like "affected host IP A1.B1.C1.D1 on vlan V1 with gateway A1.B1.C1.1 on the core" and "access switch IP A2.B2.C2.D2 on (management) vlan V2 with gateway A2.B2.C2.1 also on the core"?

brnhornt · ‎05-25-2016

Good question that I should have started first. No they do not suffer the outage at the same time. Users are connected to various IDF switches (same VLAN) but I am focused in on two at the moment. When the issue occurs they appear to be able to ping all addresses in their VLAN.

Address scheme is simple:

Core vlan1 - 172.16.204.1 255.255.252.0

Host 1 - 172.16.207.147 255.255.252.0 gateway 172.16.204.1

vantipov · ‎05-25-2016

You probably already ruled this out, but just in case, I would look at the fiber link interface statistics on both ends (the core and the access layer switch). You might be dealing with some unidirectional drops or drops related to packets of greater than a certain size.

brnhornt · ‎05-25-2016

uplink interfaces look good/clear on both ends

paul driver · ‎05-25-2016

Hello

Are these effected users on different vlans?

When you ping from the core are your sourcing from different L3 interfaces and if so does it time out on any interface or just a particular one

Are you using any static addressing ?

Whats the cpu/memory util of the core

Are you pruning the trunks?

What type of core is it - stacked- vss etc..

Are you running any first hop routing protocol (hrsp- vvrp- glbp)

Could you please post a run config of the core and an effected closet switch

res

paul

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

brnhornt · ‎05-25-2016

No...same vlan.

Yes...core cannot ping no matter what vlan (only 4) it is sourced from

Not on this vlan

11% cpu 363408580 free memory

No trunk pruning

Two stacked 3750X's

No routing protocols

paul driver · ‎05-25-2016

Hello

so between hosts same vlan no l3 routing involved

all stats look clean also you say

i would start looking at the hosts - software fw - AV scanning - infection-etc

Have you tried disabling all non ms services on an host and see that if that has an effect?

res

paul

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

brnhornt · ‎05-25-2016

More digging via wireshark that the host is receiving the ICMP requests from the core but not replying. However...during the "outage" the host and many other devices are issuing ARP broadcasts looking to resolve the default gateway (IP of vlan 1 of the core)...after several (seemingly random) seconds (up to 30ish) I see the core reply to the ARP request and then pings/traffic flow normal.

Iulian Vaideanu · ‎05-26-2016

Any broadcast storm control control configured on the switches?

brnhornt · ‎05-31-2016

After finding the ARP issues via Wireshark we decided to stop burning time troubleshooting. A reboot of the switch resolved the issue.

Access switches can ping endpoint, core cannot.