Nexus 7010 intermittently dropping packets for certain SRC/DST IP

seabird505 · ‎02-12-2012

Hi,

We have a pair of Nexus 7010 acting as the data centre gateways which connect to a pair of Nexus 5000 aggregators (using VPC) which then serves the end devices. N7K has separate VRFs for outside world by the name EXTERNAL and data centre by the name INNER (there are other VRFs but they are not relevant to this discussion). Inter-VRF routing is done by a firewall which implements data centre traffic policies (whats allowed, whats not). A traffic flow from a client to the server will look like this:

Client --intranet--> AGW --L3--> N7K-02 --L2Trunk--> FW --L2Trunk--> N7K-01 --L2Trunk--> N5K-02 --L2Trunk--> Server

FW is directly hanging off both the N7K via multiple trunk links (which are in a VPC on the N7K end). As shown, from the Access Gateway (AGW), a packet hits the first N7K, gets routed by the FW and then reaches the second N7K. Via the VPC Lnk, reaches the first N7K. Then it takes the second N5K and reaches the virtualized UCS server. This is the forward traffic path only.

Now the problem , intermittently a SYN packet from the client to the server is dropped at the trailing N7K-01. I say dropped because its not captured on N7K-01 on the link towards N5K-02. Capture on the N5K-02 confirms its not receiving any. In 100 iterations of client making a complete TCP connection to the server, about 5-8% of the connections have this fate. The client is configured with a very long TCP connect() timeout value so we sometimes see one, two, three even more SYN getting dropped before that particular iteration is successful. Mostly its one SYN getting dropped but in one of the earlier reported cases the client reported to have transaction time of 189 seconds indicating 6 SYN (exponential tcp connect() backoff) of the same session were lost. Other packet types may also be getting dropped but its not a huge number.

While this may at first indicate a network congestion / error issue. We don't have congestion issues or packet loss in general in the data centre. This only happens from certain clients to certain servers and happens intermittently. The same client going to a different IP address on the same server is always successful - 100% all the time. Also, a different client going to the same server IP is always successful. Also, after upgrading the FW last week which requries a reboot of the device severing its links with N7K, the N7K now seem to exhibit this intermittent behaviour for different bunch of client/server IP combinations.

Any help will be greately appreciated. Thanks for your time.

Client app used in troubleshooting: tnsping (may be other protocols suffer too, but havn't done any testing on that)

Server: TNS Listener TCP 1521 port

! On Nexus 7010 
# show version
Software
  BIOS:      version 3.22.0
  kickstart: version 5.1(3)
  system:    version 5.1(3)
  BIOS compile time:       02/20/10
  kickstart image file is: bootflash:///n7000-s1-kickstart.5.1.3.bin
  kickstart compile time:  12/25/2020 12:00:00 [03/11/2011 18:42:56]
  system image file is:    bootflash:///n7000-s1-dk9.5.1.3.bin
  system compile time:     1/21/2011 19:00:00 [03/11/2011 19:37:35]
Hardware
  cisco Nexus7000 C7010 (10 Slot) Chassis ("Supervisor module-1X")
  Intel(R) Xeon(R) CPU         with 4115812 kB of memory.
  Processor Board ID JAF1414AADD
  Device name: <<<device-name>>>
  bootflash:    2000880 kB
  slot0:              0 kB (expansion flash)
plugin
  Core Plugin, Ethernet Plugin

Regards, Rashid.

seabird505 · ‎02-20-2012

For the benefit of others, here is what we found. The N7K was hitting the bug CSCtg95381.

Symptom
:
Nexus 7000 may punt traffic to CPU; so that the traffic may experience random delay or drop.
Further looking, ARP is learned and FIB adjacency is in FIB adjacency table.
Conditions
:
The problem is caused by race condition. Some hosts have not responded to the ARP refresh sent by
N7k which in turn trigger to delete ARP entry due to expiry. As a result the route delete notification is
sent to URIB from the process. However there is still traffic coming to given IP address as a result the next packet that hit glean resulting in triggering ARP and hope ARP is learnt from the host this time.
Workaround
(s):
Clear ip route < host>.

Not totally explains why it was working for certain client-server combination but yet the workaround is holding well for end-points when implemented.

There would be no host route for the destination server in the adjacency manager on N7K-01. The only thing thats there is the subnet route pointing towards the vlan gateway address. Implementing the work-around, a new /32 route can now be seen in the adjacency manager for the server.

The bug is fixed in releases starting 5.1(5). Planning to upgrade to 5.2(3a).

Regards, Rashid.

david.tran · ‎02-20-2012

We ran into a similar situation you experienced as well last week. I was ready to blame the other vendor until my boss saw the same behavior as you did and confirmed that the 7K was the root cause.

The Nexus product is not yet ready for prime time.

Khoa Pham · ‎05-30-2012

Do you have any update on this? We experienced the same issue and has been waiting more than 3 weeks now for Cisco TAC to respond

https://supportforums.cisco.com/thread/2148874

jadh · ‎09-01-2014

Hi, We are running in to a similar problem, packets with a different source address are getting dropped by the Nexus 7k. We are running 6.2.2 version of the code and we have confirmed that it is nexus 7k which is dropping the packets.

Few packets sent to the same destination but with different source address gets dropped. So do you think, clearing the ip route for the destination would work here?

Any comments or suggestions greatly appreciated!