Solved: Re: %FWM-6-MAC_MOVE_NOTIFICATION: MAC flapping between vPC host

mahbvh · ‎01-10-2012

Hi,

This has been bugging me for some time. We have VMware ESXi connected in vPC mode on a pair of N5K (through FEX). Dozens of time per day we were seeing the following errors :

2011 Nov 18 16:24:34 Canal_auber_5548_6258 %FWM-2-STM_LOOP_DETECT: Loops detected in the network among ports Po100 and Po40 vlan 395 - Disabling dynamic learn notificationsfor 180 seconds

This used to happen only on 2 ESXi running VDI payload (where a lot of VMs are instanciated). Since this was causing a lot of disruption to others serveurs connected to the N5Ks we decided to take both ESXi out until we know why this happens.

Then we enabled mac-move notification to see whether the problem was still there. Although we don't have anymore the LOOP message, we still have this (still on an ESXi running VDI payload) :

Nov 20 07:07:08 canal_auber_5548-6258 : 2011 Nov 20 07:07:08 CET: %FWM-6-MAC_MOVE_NOTIFICATION: Host 0050.5693.0416 in vlan 395 is flapping between port Po100 and port Po31

What I don't get is why the N5K would complain about seeing a MAC address flapping between the a vPC member port and the vPC peer link (I espect seeing virtual machines MAC on both sides since the ESXi is load balancing based on IP hash on both sides of the vPC)

Here is part of the configuration (same on both N5K). po100 is the vPC link, po40 is the vPC to one of the ESXi, all ESXi have the same configuration) :

interface Ethernet104/1/1

description Slot40-A1 ESX-vmnic

switchport mode trunk

switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4

01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801

spanning-tree port type edge trunk

channel-group 40

interface port-channel40

description Slot40 ESX

switchport mode trunk

vpc 40

switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4

01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801

spanning-tree port type edge trunk

speed 10000

interface port-channel100

description VPC Link

switchport mode trunk

vpc peer-link

spanning-tree port type network

speed 10000

And some log output :

N5K# sho vpc brief

vPC Peer-link status

---------------------------------------------------------------------

id Port Status Active vlans

-- ---- ------ --------------------------------------------------

1 Po100 up 1,13,15,18,65,71,200,312-314,317-321,325-326,328,3

30,332-341,343,349-350,352-357,363,369,374,376-386

,390-401,411-412,440,460,462,468-469,475,996-999,1

002-1005,2024,2026,2701,2801

vPC status

----------------------------------------------------------------------------

id Port Status Consistency Reason Active vlans

------ ----------- ------ ----------- -------------------------- -----------

--- snip ---

40 Po40 up success success 15,18,65,71

,200,312-31

4,317-321,3

25-326,328,

330,332....

Any idea would be greatly appreciated.

Regards,

Vincent.

Prashanth Krishnappa · ‎01-16-2012

I suspect you are running into bug CSCts68887 which is duplicated by CSCto34674. The NX-OS you are running 5.0(3)N1(1a) is defferred and you might want to consider upgrading to 5.0(3)N2(2b)

View solution in original post

rsimoni · ‎01-10-2012

Hi Vincent,

in the first place you should tell us who is Host 0050.5693.0416 and how it is exactly connected to your switches.

Also, what is Po31 and where it is connected to?

Riccardo

mahbvh · ‎01-10-2012

Hi Riccardo, thanks for your reply,

Po31 is a vPC connected to another ESXi (a.k.a. ESX28), configured the same way as the ESX26 connected to Po40. The MAC address mentioned belongs to one of the virtual machine running on this ESXi. The error message is not specific to this MAC address, I get it for virtually all VMs running on this host.

Here's the configuration pertaining to ESX28 :

interface port-channel31

description Slot31 ESX28

switchport mode trunk

vpc 31

switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4

01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801

spanning-tree port type edge trunk

speed 10000

interface Ethernet103/1/5

description Slot31-B1 ESX28-5

switchport mode trunk

switchport trunk allowed vlan 15,18,65,71,200,312-314,317-321,325-326,328,330,332-341,343,349-350,352-357,363,369,374,376-381,383-385,390-4

01,411-412,440,460,462,468-469,475,996-999,2024,2026,2701,2801

spanning-tree port type edge trunk

channel-group 31

Vincent.

rsimoni · ‎01-10-2012

Hi Vincent,

It seems that the ESX is sending frames on both links.

How did you configure the channel on the ESX side?

Are you sure that both ports belong to the same channel?

Also, do you use some kind of load balancing scheme? If you have a kind of active/active configuration and the ESX uses a virtual MAC for both NICs the msg you see is expected as frames with the same source MAC will be learned by the N5K from 2 different ports, triggering the message and temporary disabling mac learning.

Can you check and let me know?

Riccardo

mahbvh · ‎01-11-2012

Hello Ricardo,

Each ESXi has 2 10GB NICs, one connected to 5K1 and the other to 5K2 on the corresponding port. For example, NIC1 of ESX28 is connected to 5K1-eth103/1/5 and NIC2 to 5K2-eth103/1/5. Both NIC are forming VPC31 through Po31 using 802.3ad as per the configuration above.

On the host side, VMware vDS is configured to use the "IP Hash" algorithm which VMware name for 802.3ad (cf Cisco White Paper). VMware vDS does not supprot LACP.

So yes it's an active/active configuration and it's expected that a Virtual Machine MAC address is seen on both sides. It's a very basic setup that we have for ~30 ESXi on the same pair of N5K and it's usually working perfectly except for this glitch on VLAN395, as if for some reason the N5K would not consider VPC31 as a vPC although it claims it does...

NB: I triple checked that the NICs where part of the same channel, both physically and through CDP.

Vincent.

rsimoni · ‎01-11-2012

Hi Vincent,

there is something which is puzzling me.

You wrote that ESX28 is dual-homed to 5K1 and 5K2. However you said that its NICs are connected to eth103/1/5. According to the interface name they are connected to FEX. Now, what I don't understand is the way you connected to the FEX (or FEX's). The so called Dual-Homed FEX Topology (Active/Active FEX Topology) requires that both server NICs are connected to the same FEX, like in the picture below.

It implies that your server will be connected to 2 different ports, so it cannot be eth103/1/5 only, as this is one link only.

If you have this topology I would expected that your server is connected to 103/1/5 and, for instance, 103/1/6.

So I wonder if you have instead a FEX Straight-Through Topology (Host vPC) topology with the server dual-homed to 2 different FEX's, like the picture below.

But also here I don't get how the 2 ports have the same name, like they were connected to the same FEX instead.

It would make more sense if your ports are connected to 103/1/5 and, for instance, 104/1/5.

Which topology do you have?

Can you print 'show fex' from both N5k please?

The topology is important to determine the type of LB you can have.

With FEX (2148) in straight through mode you can have up to two ports in a vPC per server with each port terminating on a different FEX. So FEX straight-through = active/active on server.

With FEX in Active/Active topology you cannot have host vPC. You can, however, run the servers in Active/Standby or TLB (Transmit Load Balancing) configurations.

If your FEX is Active/Active but you configure vPC mode on the server you might have MAC flapping between VPCx and VPC peer-link on the Nexus.

Riccardo

mahbvh · ‎01-11-2012

Ricardo,

I'm sorry I was not clear enough : My 2232PP FEX are attached in Straighthrough topology, and the port numbers are the same on both sides because the FEX numbers are the smae on both sides :

N5K1# show fex

FEX FEX FEX FEX

Number Description State Model Serial

------------------------------------------------------------------------

101 FEX0101 Online N2K-C2232PP-10GE JAF1443AFNB

102 FEX0102 Online N2K-C2232PP-10GE JAF1444BTBA

103 FEX0103 Online N2K-C2232PP-10GE JAF1444BSSR

104 FEX0104 Online N2K-C2232PP-10GE JAF1444BTAD

105 FEX0105 Online N2K-C2248TP-1GE JAF1523DEPG

106 FEX0106 Online N2K-C2248TP-1GE JAF1523DECR

N5K2

# show FEX

FEX FEX FEX FEX

Number Description State Model Serial

------------------------------------------------------------------------

101 FEX0101 Online N2K-C2232PP-10GE JAF1443AFHR

102 FEX0102 Online N2K-C2232PP-10GE JAF1444BSKR

103 FEX0103 Online N2K-C2232PP-10GE JAF1443AFNP

104 FEX0104 Online N2K-C2232PP-10GE JAF1443AFKD

105 FEX0105 Online N2K-C2248TP-1GE JAF1523CPBN

106 FEX0106 Online N2K-C2248TP-1GE JAF1523DETE

AFAIK, it's best practice to have the same numbers on both sides. This makes the N5K configuration much simpler as it's the same on both N5K.

NB: As of a recent NX-OS release, active/active dual-homed servers are now supported on Dual-Home FEX topology, even when plug on 2 different FEXs.

Vincent.

rsimoni · ‎01-11-2012

Hi Vincent,

ok I see, thanks for claryfing. Without proper topology understanding is quite difficult to determine why a given behavior occcurs. You don't have VPCs between n5k and FEXs right? I guess you really have the second topology I posted, correct?

Also, which release you run on N5k?

Few questions about the flapping messages:

1) Is the flapping MAC always flapping between host vPC and vPC link or do you see it flapping between other ports too?

2) How often do you see a given MAC flapping during the day?

3) Do you see the flapping on one N5k only or on both of them?

4) Do you have orphan ports on any N5k in the flapping vlan? If yes where? Also is the orphan port connected to the same server by any chance?

I was just trying to thinking the possible cases where the flapping is between the vpc peer link and the host vpc and, considered the way MAC are learned on vpc implementation, I can only imagine if that MAC comes from some orphan port in the same vlan.

then just to be sure we are on the same page can you print, FROM both switches, the configuration of the host vPC (31), the vpc peer link (I have it already but I would like evrything in one shot) and the link (or channel) between n5k and FEX's.

thanks,

Riccardo

mahbvh · ‎01-12-2012

Hello Ricardo,

Riccardo Simoni wrote:

You don't have VPCs between n5k and FEXs right? I guess you really have the second topology I posted, correct?

Also, which release you run on N5k ?

That's right, no vPC for attaching the FEX, second topology.

We are running version 5.0(3)N1(1a)

Riccardo Simoni wrote:

1) Is the flapping MAC always flapping between host vPC and vPC link or do you see it flapping between other ports too?

2) How often do you see a given MAC flapping during the day?

3) Do you see the flapping on one N5k only or on both of them?

4) Do you have orphan ports on any N5k in the flapping vlan? If yes where? Also is the orphan port connected to the same server by any chance?

Always between Host vPC and vPC link
Up to 3000/hour
The message only shows on the 5K1 which is the Primary vPC, STP root for VLAN 395 and the one seeing the VLAN HSRP default gateway (so all packets on 5K2 trying to reach the router on this VLAN must go through the vPC peer link). Usually intra-VLAN packets since the only orphaned ports are the one connected to the Cat4500 routers.
Each 5K has its Po1000 orphaned to its Cat4500 to reach the core. NB: VLAN395 is blocked between both Cat4500. Po1000 is the only orphaned port on each N5K.

can you print, FROM both switches, the configuration of the host vPC (31), the vpc peer link (I have it already but I would like evrything in one shot) and the link (or channel) between n5k and FEX's.

You'll find stripped down configurations attached with everything relevant to the host (Po31), corresponding FEX (Po103), vPC peer link (Po100) and uplink to the core (Po1000)

Thanks for your help,

Vincent.

rsimoni · ‎01-12-2012

Vincent,

I'm too busy today.

Will try to continue with your issue tomorrow.

meanwhile let's see if somebody else can add some other idea.

Riccardo

rsimoni · ‎01-13-2012

Vincent,

The config looks ok to me.

can you confirm that your ESXi does not have another NIC connected to some other switch in vlan 395 and sends traffic from the same MAC though that link?

I am asking as in my opinion you see this type of flapping for one of the following causes (I cannot think of others).

1- Traffic in vlan395 from that MAC reaches the core through a non-VPC link and gets then flooded back from N5k2 to its peer link, In this case N5k1 will re-learn the frame from the vPC peer-link portchannel

2- The VPN peer-link has a faulty link causing issue with CFS (protocol which take care of MAC learning and synchcronization via the vpc peer-link).

3- Sw issue on 5k1 which errouneously try to learn the MAC on the vpc peer-link

4- Another host in vlan 395 using the same MAC address

I guess that after verifying that the ESXi is not connected to anything else than the FEX's we should stop here and continue on a TAC case. There are lots of other things to check (l2fm logs to see reason of MAC learning in N5k1 for instance) on multiple internal logs which can help identifying the root casue.

I think we are already far beyond the depth a CSC thread should go to.

Riccardo

mahbvh · ‎01-13-2012

Hello Ricardo,

The ESXi has only 2 NIC on VLAN395, and they are both attached to this vPC. Besides, the problem occured on 2 other ESXi running the same VDI payload.

Regarding your 4 possible causes :

That's the first thing I could think of when I noticed the problem. The LAN core has been checked and re-checked. Besides the error message would also appear on the other N5K and most likely on core switches. Furthermore, this would happen for all ESXi with VMs attached to VLAN 395, which is not the case. This is why I started to suspect some bug in the vPC area.
I checked this too. The 4 links have not flapped a single time in the last 43 weeks, from both N5K standpoint.
That's the only thing I can think of
Very unlikely. First because MAC addresses are uniquely assigned to the VMs by vSphere on a private MAC range (and I already checked in VCenter that all MACs seen on virtual switches were unique). Second, many MAC@ exhibit the same problem.

Thanks for the time you dedicated in this.

So you think we should convert this thread in a TAC case ?

Cheers,

Vincent.

rsimoni · ‎01-13-2012

So you think we should convert this thread in a TAC case ?

I do. If you let me know the SR number I will have a look at how it will end up as now I am curious

maybe it is a trivial thing that we are completely overlooking (it would not be the first time ) ; however a TAC case where an engineer can spend some time carefully checking show tech and other logs is the way to go.

Take care

Riccardo

mahbvh · ‎01-16-2012

Hi Riccardo,

FYI TAC case 620366529 was opened but not by myself so I can't follow it directly.

I'll post there when I know the happy ending !

Thanks again for your help on this.

Cheers,

Vincent.

Prashanth Krishnappa · ‎01-16-2012

I suspect you are running into bug CSCts68887 which is duplicated by CSCto34674. The NX-OS you are running 5.0(3)N1(1a) is defferred and you might want to consider upgrading to 5.0(3)N2(2b)

%FWM-6-MAC_MOVE_NOTIFICATION: MAC flapping between vPC host port and vPC peer-link