I have several Nexus 3000-series switches deployed (a combination of 3048 and 3064's), all of them running either NXOS 9.2(1) or 9.2(3).
I recently noticed a strange issue where clients connected to the switch experience about 0.75 seconds of outbound packet loss once every 30 minutes (exactly 1800 seconds). Has anyone else experienced an issue like this before?
To verify my theory and get precise measurements, I wrote a simple client/server script to send a UDP packet every 25ms. The packet contained a unique, increment counter as well as a timestamp and some other random data to increase its size. From this, I would see that there is a period of about 0.75 seconds of packet loss once every 30 minutes.
This test was performed with multiple different systems on multiple different switches with the same result, leading me to believe that the issue lies within the nexus switch or NXOS itself.
[ 1564497484.6109 ] Packet loss detected. Expected: 9868, Received: 9900, TS Difference: 0.8309 [ 1564499285.5818 ] Packet loss detected. Expected: 81316, Received: 81349, TS Difference: 0.8564 [ 1564501086.6171 ] Packet loss detected. Expected: 152763, Received: 152798, TS Difference: 0.9072
(Output above from my script showing a 0.8-0.9s gap of packet loss)
All of the switches are configured to be L3 routed with SVI:
interface Ethernet1/1 switchport access vlan 10 interface Vlan10 no shutdown ip address 10.0.0.1/29 ip route 0.0.0.0/0 <isp gateway>
None of these switches are heavily loaded, they're all pushing <1Gbps and upstream links are multiple 10G LACP. The fact that packet loss is happening at such a scheduled interval (once every 30 minutes) leads me to believe that it has something to do with an internal scheduler function in NXOS causing a disruption to the data flow.
If anyone could test this in their own lab using the same hardware/software, that would be greatly appreciated. Alternatively, if anyone has insight into further debugging this problem, please do let me know.
So I spent a bit more time experimenting with the various features/configuration of NX-OS and it appears that this issue is related to port-security. Here's an example configuration:
feature port-security interface Ethernet1/1 switchport access vlan 100 switchport port-security aging time 30 switchport port-security violation restrict switchport port-security
When the port-security aging time was changed from 30 minutes to 5 minutes, the packet loss started happening once every 5 minutes:
[ 1564540925.8576 ] Packet loss detected. Expected: 957931, Received: 957944, TS Difference: 0.3511 [ 1564541226.5920 ] Packet loss detected. Expected: 969906, Received: 969935, TS Difference: 0.7523
When port-security was disabled entirely, the packet loss stopped.
I upgraded one of my Nexus 3048TP lab switches from 9.2(3) to 7.0(3)I7(6), since that's the latest stable recommended release. This issue still appears in 7.0(3)I7(6).
Software BIOS: version 5.0.0 NXOS: version 7.0(3)I7(6) BIOS compile time: 06/06/2018 NXOS image file is: bootflash:///nxos.7.0.3.I7.6.bin NXOS compile time: 3/5/2019 13:00:00 [03/05/2019 17:04:55]
For now, I'll leave port-security disabled, but has anyone used this feature in the nexus 3000-series switches without this kind of packet loss happening?
I believe the reason of this is due to the aging type used. There are 2 aging types when it comes to port-security:
- Absolute where the MAC address is delete after the aging time.
- Inactivity, where the MAC is deleted only if it has been inactive for the time set.
I believe you are using the absolute one. So every 30, or 5 minutes the mac gets deleted from the allowed entries. This action creates a blackole while the MAC gets learned again and permitted.
You should try to run the command "switchport port-security aging type inactivity ". You should be able to solve the problem with it.
Well done with your troubleshooting steps though!! You almost nailed it.
That does seem to be the issue. Unfortunately it looks like the Nexus 3000-series switches don't support inactiity based timers, only absolute.
Yep, documentation is confirming it:
The device ages MAC addresses learned by the dynamic method and drops them after the age limit is reached. You can configure the age limit on each interface. The range is from 0 to 1440 minutes, where 0 disables aging.
The method that the device uses to determine that the MAC address age is also configurable. The only method of determining address age is:
The length of time after the device learned the address. This is the default aging method; however, the default aging time is 0 minutes, which disables aging.