01-23-2016 04:58 PM - edited 03-01-2019 08:09 AM
Hello Expert,
I have new implementation where I am installing Nexus 9504 as Core layer for Data Center, Nexus 93128TX as TOR for Servers and Nexus 9396PX as Aggregation for other devices other than servers (like routers, edge/floor switches , etc). All of them are running 7.0(3)I1(3). The setup is working in a standalone NX-OS mode (not ACI).
The topology as shown in diagram 1.
In random occasions, the connectivity between Core switches and other devices in the network (i.e. AGG switches) goes down. Ping the core from one of the directly connected Agg switches NOT work (ping directly connected SVIs physical and virtual IP). Also, EIGRP goes down and become no EIGRP adjacency between the core SW’s and the WAN router. In other word, connectivity inside the LAN (servers and internal users from different vlan’s ) and outside to branches via the WAN router, connectivity to both goes down.
This outage lasts for 30-40 minutes, then all of a sudden, the problem goes away and the setup forwards traffic fine, then after few hours or minutes the same problem repeats.
In order to isolate the issue, we tried to simplify the setup by SHUT core B so each Leaf become connected to core A. The setup become as shown in diagram 2:
And also I have the same problem with this setup (diagram 2).
I found the following bug but hitting the N7K :
CSCus91417: Nexus 7000 fails to trasmitt RSTP BPDUs consistently
Symptom:
- STP Instability in layer 2 domain with N7K (Rapid PVST).
- This issue is seen in large scale setups with approximately 6000 RSTP sessions running. The issue is independent of the number of VDCs.
Conditions:
Packet drops were seen in the inband soft VOQ Transmit side on the SUP. This issue is seen on N7000 SUP2 supervisors. Packet drops were NOT seen on N7000 SUP2E and N7700 SUP2 Supervisors.
Workaround:
If customer is running RSTP at the scale level they can see this issue. Peer-switch has been shown to be an effective workaround in the field generally. However, if customer has a lot of orphan ports, single connected, peer-switch does not help in this situation. Additionally, as scale increases toward the RSTP limit, peer-switch becomes slightly less effective since we are dropping BDPUs from both sides more often.
Further Problem Description:
Inband packet drops on the transmit side of the Supervisor were seen in large Scale setups. The packet drops were noticed on "show hardware internal sup-fc0 voq-stats".
Could the above bug causing the outage? Any one aware if its hitting the N9000? Or could be another reason like a mis-configuration?
Any help/feedback is highly appreciated.
Best Regards,
Mohammad Taamneh
01-24-2016 04:46 PM
You are not hitting CSCus91417. This bug is very specific to Nexus 7k.
Your symptom indicates a possibility of a control plane issue where CPU is probably inundated with some traffic that is causing other protocols to get victimized. Check CoPP (Show policy-map interface control-place) on the 9K and see if you notice anything wierd or drops. This may have to be monitored at the time of the event.
Also see if you can SPAN the CPU of the 9k as well. This may give clues on CPU bound traffic.
-Raj
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide