cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
909
Views
0
Helpful
2
Replies

Catalyst 3850 - lost ARP Queries

rtrsaravanan
Level 1
Level 1

Hi,

We are seeing  this issue in one of our customers' network. Occasional ARP queries from servers go unanswered by the 3850 access switches they connect to, which is causing the interfaces on the servers to switch over.

- Servers are using ARP queries every 500ms for link monitoring. 

- The servers are connected to 3850 switches , IP base license, running 3.6.8E so far. We just upgraded them to 16.3.7 to see if it fixes the issue.

- We see this occurring at random times, no specific trigger we could point at. 

- All 1 GE copper interface. And all the attached servers are control traffic, so no way the traffic exceeds line rate. 

 - Sometimes it just switches over, sometimes it flaps back and forth a few times.

Anyone else has seen this in their network ?

Detail information below

 

The network originally had a pair 3750s (stand alone) connecting to about 10 rack mount servers. 

The servers are connecting one up-link each , to both the switches. The servers have their interfaces configured for active-standby.  On the switch side it is normal access port configuration towards these servers. No access lists, no qos, no security features.

 

When we tried to replace the 3750s with 3850s, we started seeing HSRP & Mac Flaps. 

 

We fixed HSRP flaps by changing the timers. They were originally configured with 1 & 3 seconds. We changed to 3 & 10 (default), after reading about the know defect for 3850 platform with HSRP 1 & 3 s timers. 

 

With Mac Flaps - we found that the reason was the the servers active standby interfaces were flapping. Some information might help here. The 2 x 3850s connect 1) to about 7 servers downstream, 2) to each other & 3) to uplink for L3 GW. Some VLANs are being using purely local between the servers. So they aren't configured on the uplink towards the routers. However, for these local VLANs, SVIs are being configured on the access 3850 switches in question. The servers then ARP for these local VLAN SVIs to determine the status of the ink connecting to the switches. ARP queries sent every 500ms.  Sometimes the servers dont see response 3 continues queries and switches over to standby interface. We don't see any logs on the switches. 

 

We can't explain why the queries are being missed. And what makes it more difficult to troubleshoot is, we dont see this proble when we plug the old 3750s back in the network, with exactly same configuration. 

 

We dont see any drops in the CoPP logs.

sc002-in-sw03#$rm hardware fed switch 1 qos que stat internal cpu policer
(default) (set)
QId PlcIdx Queue Name Enabled Rate Rate Drop
------------------------------------------------------------------------
0 11 DOT1X Auth No 1000 1000 0
1 1 L2 Control No 500 500 0
2 14 Forus traffic No 1000 1000 0
3 0 ICMP GEN Yes 200 200 0
4 2 Routing Control Yes 1800 1800 0
5 14 Forus Address resolution No 1000 1000 0
6 3 Punt Copy to ICMP Redirect No 500 500 0
7 6 WLESS PRI-5 No 1000 1000 0
8 4 WLESS PRI-1 No 1000 1000 0
9 5 WLESS PRI-2 No 1000 1000 0
10 6 WLESS PRI-3 No 1000 1000 0
11 6 WLESS PRI-4 No 1000 1000 0
12 0 BROADCAST Yes 200 200 0
13 16 Learning cache ovfl No 100 1000 0
14 13 Sw forwarding Yes 1000 1000 0
15 8 Topology Control No 13000 13000 0
16 12 Proto Snooping No 500 500 0
17 16 DHCP Snooping No 1000 1000 0
18 9 Transit Traffic Yes 500 500 0
19 10 RPF Failed Yes 100 100 0
20 15 MCAST END STATION Yes 2000 2000 0
21 13 LOGGING Yes 1000 1000 0
22 7 Punt Webauth No 1000 1000 0
23 10 Crypto Control Yes 100 100 0
24 10 Exception Yes 100 100 0
25 3 General Punt No 500 500 0
26 10 NFL SAMPLED DATA Yes 100 100 0
27 2 Low Latency Yes 1800 1800 0
28 10 EGR Exception Yes 100 100 0
29 16 Stackwise Virtual Ctrl No 1000 1000 0
30 9 MCAST Data Yes 500 500 0
31 10 Gold Pkt Yes 100 100 0

 

 

However, in this below output, I see the drop counter for Queue 4 & 5 increasing continuously. 

sc002-in-sw03#$ch active qos queue stats internal port_type egress-cpu asic 1

.

.

.

CPU Port:0 Drop Counters
-------------------------------
Queue Drop-TH0 Drop-TH1 Drop-TH2 SBufDrop QebDrop
----- ----------- ----------- ----------- ----------- -----------
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 26511251 0 0
5 0 0 258438928 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0

.

.

.

Can anyone help understand how do I interpret these outputs. 

 

Update - 12/12

Attaching a high level topology diagram.(it was like this when we got there ! not design by choice)

The boxes in red are the ones this problem description is about. There are more servers attached, than shown in the figure.. 

Capture.JPG

 

 Thanks !

2 Replies 2

Hello


@rtrsaravanan wrote:

With Mac Flaps - we found that the reason was the the servers active standby interfaces were flapping. Some information might help here. The 2 x 3850s connect 1) to about 7 servers downstream, 2) to each other & 3) to uplink for L3 GW. Some VLANs are being using purely local between the servers. So they aren't configured on the uplink towards the routers. However, for these local VLANs, SVIs are being configured on the access 3850 switches in question. The servers then ARP for these local VLAN SVIs to determine the status of the ink connecting to the switches. ARP queries sent every 500ms.  Sometimes the servers dont see response 3 continues queries and switches over to standby interface. We don't see any logs on the switches. 


Can you post a topology of this please.


Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Hi Paul,
just updated the original problem description to include a high level diagram as well. Problem sw-01 & 02 are the 3850s in question. They have SVI's configured for the local VLANs. For everything else, L3 is just above the core switch stack.