ā01-10-2012 07:41 AM - edited ā03-07-2019 04:15 AM
Hi,
This has been bugging me for some time. We have VMware ESXi connected in vPC mode on a pair of N5K (through FEX). Dozens of time per day we were seeing the following errors :
2011 Nov 18 16:24:34 Canal_auber_5548_6258 %FWM-2-STM_LOOP_DETECT: Loops detected in the network among ports Po100 and Po40 vlan 395 - Disabling dynamic learn notificationsfor 180 seconds
This used to happen only on 2 ESXi running VDI payload (where a lot of VMs are instanciated). Since this was causing a lot of disruption to others serveurs connected to the N5Ks we decided to take both ESXi out until we know why this happens.
Then we enabled mac-move notification to see whether the problem was still there. Although we don't have anymore the LOOP message, we still have this (still on an ESXi running VDI payload) :
Nov 20 07:07:08 canal_auber_5548-6258 : 2011 Nov 20 07:07:08 CET: %FWM-6-MAC_MOVE_NOTIFICATION: Host 0050.5693.0416 in vlan 395 is flapping between port Po100 and port Po31
What I don't get is why the N5K would complain about seeing a MAC address flapping between the a vPC member port and the vPC peer link (I espect seeing virtual machines MAC on both sides since the ESXi is load balancing based on IP hash on both sides of the vPC)
Here is part of the configuration (same on both N5K). po100 is the vPC link, po40 is the vPC to one of the ESXi, all ESXi have the same configuration) :
interface Ethernet104/1/1
description Slot40-A1 ESX-vmnic
switchport mode trunk
switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4
01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801
spanning-tree port type edge trunk
channel-group 40
interface port-channel40
description Slot40 ESX
switchport mode trunk
vpc 40
switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4
01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801
spanning-tree port type edge trunk
speed 10000
interface port-channel100
description VPC Link
switchport mode trunk
vpc peer-link
spanning-tree port type network
speed 10000
And some log output :
N5K# sho vpc brief
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po100 up 1,13,15,18,65,71,200,312-314,317-321,325-326,328,3
30,332-341,343,349-350,352-357,363,369,374,376-386
,390-401,411-412,440,460,462,468-469,475,996-999,1
002-1005,2024,2026,2701,2801
vPC status
----------------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
------ ----------- ------ ----------- -------------------------- -----------
--- snip ---
40 Po40 up success success 15,18,65,71
,200,312-31
4,317-321,3
25-326,328,
330,332....
Any idea would be greatly appreciated.
Regards,
Vincent.
Solved! Go to Solution.
ā01-16-2012 06:45 AM
I suspect you are running into bug CSCts68887 which is duplicated by CSCto34674. The NX-OS you are running 5.0(3)N1(1a) is defferred and you might want to consider upgrading to 5.0(3)N2(2b)
ā01-10-2012 08:26 AM
Hi Vincent,
in the first place you should tell us who is Host 0050.5693.0416 and how it is exactly connected to your switches.
Also, what is Po31 and where it is connected to?
Riccardo
ā01-10-2012 08:36 AM
Hi Riccardo, thanks for your reply,
Po31 is a vPC connected to another ESXi (a.k.a. ESX28), configured the same way as the ESX26 connected to Po40. The MAC address mentioned belongs to one of the virtual machine running on this ESXi. The error message is not specific to this MAC address, I get it for virtually all VMs running on this host.
Here's the configuration pertaining to ESX28 :
interface port-channel31
description Slot31 ESX28
switchport mode trunk
vpc 31
switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4
01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801
spanning-tree port type edge trunk
speed 10000
interface Ethernet103/1/5
description Slot31-B1 ESX28-5
switchport mode trunk
switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4
01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801
spanning-tree port type edge trunk
channel-group 31
Vincent.
ā01-10-2012 10:36 AM
Hi Vincent,
It seems that the ESX is sending frames on both links.
How did you configure the channel on the ESX side?
Are you sure that both ports belong to the same channel?
Also, do you use some kind of load balancing scheme? If you have a kind of active/active configuration and the ESX uses a virtual MAC for both NICs the msg you see is expected as frames with the same source MAC will be learned by the N5K from 2 different ports, triggering the message and temporary disabling mac learning.
Can you check and let me know?
Riccardo
ā01-11-2012 12:51 AM
Hello Ricardo,
Each ESXi has 2 10GB NICs, one connected to 5K1 and the other to 5K2 on the corresponding port. For example, NIC1 of ESX28 is connected to 5K1-eth103/1/5 and NIC2 to 5K2-eth103/1/5. Both NIC are forming VPC31 through Po31 using 802.3ad as per the configuration above.
On the host side, VMware vDS is configured to use the "IP Hash" algorithm which VMware name for 802.3ad (cf Cisco White Paper). VMware vDS does not supprot LACP.
So yes it's an active/active configuration and it's expected that a Virtual Machine MAC address is seen on both sides. It's a very basic setup that we have for ~30 ESXi on the same pair of N5K and it's usually working perfectly except for this glitch on VLAN395, as if for some reason the N5K would not consider VPC31 as a vPC although it claims it does...
NB: I triple checked that the NICs where part of the same channel, both physically and through CDP.
Vincent.
ā01-11-2012 05:45 AM
Hi Vincent,
there is something which is puzzling me.
You wrote that ESX28 is dual-homed to 5K1 and 5K2. However you said that its NICs are connected to eth103/1/5. According to the interface name they are connected to FEX. Now, what I don't understand is the way you connected to the FEX (or FEX's). The so called Dual-Homed FEX Topology (Active/Active FEX Topology) requires that both server NICs are connected to the same FEX, like in the picture below.
It implies that your server will be connected to 2 different ports, so it cannot be eth103/1/5 only, as this is one link only.
If you have this topology I would expected that your server is connected to 103/1/5 and, for instance, 103/1/6.
So I wonder if you have instead a FEX Straight-Through Topology (Host vPC) topology with the server dual-homed to 2 different FEX's, like the picture below.
But also here I don't get how the 2 ports have the same name, like they were connected to the same FEX instead.
It would make more sense if your ports are connected to 103/1/5 and, for instance, 104/1/5.
Which topology do you have?
Can you print 'show fex' from both N5k please?
The topology is important to determine the type of LB you can have.
With FEX (2148) in straight through mode you can have up to two ports in a vPC per server with each port terminating on a different FEX. So FEX straight-through = active/active on server.
With FEX in Active/Active topology you cannot have host vPC. You can, however, run the servers in Active/Standby or TLB (Transmit Load Balancing) configurations.
If your FEX is Active/Active but you configure vPC mode on the server you might have MAC flapping between VPCx and VPC peer-link on the Nexus.
Riccardo
ā01-11-2012 06:15 AM
Ricardo,
I'm sorry I was not clear enough : My 2232PP FEX are attached in Straighthrough topology, and the port numbers are the same on both sides because the FEX numbers are the smae on both sides :
N5K1# show fex
FEX FEX FEX FEX
Number Description State Model Serial
------------------------------------------------------------------------
101 FEX0101 Online N2K-C2232PP-10GE JAF1443AFNB
102 FEX0102 Online N2K-C2232PP-10GE JAF1444BTBA
103 FEX0103 Online N2K-C2232PP-10GE JAF1444BSSR
104 FEX0104 Online N2K-C2232PP-10GE JAF1444BTAD
105 FEX0105 Online N2K-C2248TP-1GE JAF1523DEPG
106 FEX0106 Online N2K-C2248TP-1GE JAF1523DECR
N5K2
# show FEX
FEX FEX FEX FEX
Number Description State Model Serial
------------------------------------------------------------------------
101 FEX0101 Online N2K-C2232PP-10GE JAF1443AFHR
102 FEX0102 Online N2K-C2232PP-10GE JAF1444BSKR
103 FEX0103 Online N2K-C2232PP-10GE JAF1443AFNP
104 FEX0104 Online N2K-C2232PP-10GE JAF1443AFKD
105 FEX0105 Online N2K-C2248TP-1GE JAF1523CPBN
106 FEX0106 Online N2K-C2248TP-1GE JAF1523DETE
AFAIK, it's best practice to have the same numbers on both sides. This makes the N5K configuration much simpler as it's the same on both N5K.
NB: As of a recent NX-OS release, active/active dual-homed servers are now supported on Dual-Home FEX topology, even when plug on 2 different FEXs.
Vincent.
ā01-11-2012 06:55 AM
Hi Vincent,
ok I see, thanks for claryfing. Without proper topology understanding is quite difficult to determine why a given behavior occcurs. You don't have VPCs between n5k and FEXs right? I guess you really have the second topology I posted, correct?
Also, which release you run on N5k?
Few questions about the flapping messages:
1) Is the flapping MAC always flapping between host vPC and vPC link or do you see it flapping between other ports too?
2) How often do you see a given MAC flapping during the day?
3) Do you see the flapping on one N5k only or on both of them?
4) Do you have orphan ports on any N5k in the flapping vlan? If yes where? Also is the orphan port connected to the same server by any chance?
I was just trying to thinking the possible cases where the flapping is between the vpc peer link and the host vpc and, considered the way MAC are learned on vpc implementation, I can only imagine if that MAC comes from some orphan port in the same vlan.
then just to be sure we are on the same page can you print, FROM both switches, the configuration of the host vPC (31), the vpc peer link (I have it already but I would like evrything in one shot) and the link (or channel) between n5k and FEX's.
thanks,
Riccardo
ā01-12-2012 06:39 AM
Hello Ricardo,
Riccardo Simoni wrote:
You don't have VPCs between n5k and FEXs right? I guess you really have the second topology I posted, correct?
Also, which release you run on N5k ?
That's right, no vPC for attaching the FEX, second topology.
We are running version 5.0(3)N1(1a)
Riccardo Simoni wrote:
1) Is the flapping MAC always flapping between host vPC and vPC link or do you see it flapping between other ports too?
2) How often do you see a given MAC flapping during the day?
3) Do you see the flapping on one N5k only or on both of them?
4) Do you have orphan ports on any N5k in the flapping vlan? If yes where? Also is the orphan port connected to the same server by any chance?
can you print, FROM both switches, the configuration of the host vPC (31), the vpc peer link (I have it already but I would like evrything in one shot) and the link (or channel) between n5k and FEX's.
You'll find stripped down configurations attached with everything relevant to the host (Po31), corresponding FEX (Po103), vPC peer link (Po100) and uplink to the core (Po1000)
Thanks for your help,
Vincent.
ā01-12-2012 11:55 AM
Vincent,
I'm too busy today.
Will try to continue with your issue tomorrow.
meanwhile let's see if somebody else can add some other idea.
Riccardo
ā01-13-2012 05:22 AM
Vincent,
The config looks ok to me.
can you confirm that your ESXi does not have another NIC connected to some other switch in vlan 395 and sends traffic from the same MAC though that link?
I am asking as in my opinion you see this type of flapping for one of the following causes (I cannot think of others).
1- Traffic in vlan395 from that MAC reaches the core through a non-VPC link and gets then flooded back from N5k2 to its peer link, In this case N5k1 will re-learn the frame from the vPC peer-link portchannel
2- The VPN peer-link has a faulty link causing issue with CFS (protocol which take care of MAC learning and synchcronization via the vpc peer-link).
3- Sw issue on 5k1 which errouneously try to learn the MAC on the vpc peer-link
4- Another host in vlan 395 using the same MAC address
I guess that after verifying that the ESXi is not connected to anything else than the FEX's we should stop here and continue on a TAC case. There are lots of other things to check (l2fm logs to see reason of MAC learning in N5k1 for instance) on multiple internal logs which can help identifying the root casue.
I think we are already far beyond the depth a CSC thread should go to.
Riccardo
ā01-13-2012 08:36 AM
Hello Ricardo,
The ESXi has only 2 NIC on VLAN395, and they are both attached to this vPC. Besides, the problem occured on 2 other ESXi running the same VDI payload.
Regarding your 4 possible causes :
Thanks for the time you dedicated in this.
So you think we should convert this thread in a TAC case ?
Cheers,
Vincent.
ā01-13-2012 10:26 AM
So you think we should convert this thread in a TAC case ?
I do. If you let me know the SR number I will have a look at how it will end up as now I am curious
maybe it is a trivial thing that we are completely overlooking (it would not be the first time ) ; however a TAC case where an engineer can spend some time carefully checking show tech and other logs is the way to go.
Take care
Riccardo
ā01-16-2012 07:05 AM
Hi Riccardo,
FYI TAC case 620366529 was opened but not by myself so I can't follow it directly.
I'll post there when I know the happy ending !
Thanks again for your help on this.
Cheers,
Vincent.
ā01-16-2012 06:45 AM
I suspect you are running into bug CSCts68887 which is duplicated by CSCto34674. The NX-OS you are running 5.0(3)N1(1a) is defferred and you might want to consider upgrading to 5.0(3)N2(2b)
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide