Nexus 5000 log - FWM-2-STM_LOOP_DETECT

darkid123 · ‎10-28-2011

Hi everyone,

I'm having trouble finding any information on what these logs could mean:

2011 Oct 27 16:17:48 XALBCVMNX01 %FWM-2-STM_LOOP_DETECT: Loops detected in the network among ports Po3 and Po4 vlan 100 - Disabling dynamic learn notificat

ionsfor 180 seconds

2011 Oct 27 16:18:11 XALBCVMNX01 %KERN-3-SYSTEM_MSG: SSE call for cmd = 3 failed. rc = -1076428946[bfd6ff6eH] - kernel

2011 Oct 27 16:20:48 XALBCVMNX01 last message repeated 5 times

2011 Oct 27 16:20:48 XALBCVMNX01 %FWM-2-STM_LEARNING_RE_ENABLE: Re enabling dynamic learning on all interfaces

2011 Oct 27 16:28:11 XALBCVMNX01 %KERN-3-SYSTEM_MSG: SSE call for cmd = 3 failed. rc = -1076428946[bfd6ff6eH] - kernel

Would this cause any issues?? Downtime, performance?

I have 2 Nexus 5000s configured in VPC, and Po 3 and Po 4 connect to an ESX host (VMWare). These logs are showing up on pretty much all my switches that have the same configuration. Po3 & 4 are configured as edge trunk ports... See config below. Port 4 is configured the same way.

Ethernet1/3 on Switch 1 is configured as a port channel (VPC) with E1/3 on Switch 2. Same thing for E1/4 on Switch 1 in a VPC with E1/4 on Switch 2.

All these ports go to 1 ESX host with 4 CNA ports.

interface port-channel3

switchport mode trunk

vpc 3

spanning-tree port type edge trunk

interface Ethernet1/3

switchport mode trunk

channel-group 3

XALBCVMNX01# sh int po 3

port-channel3 is up

vPC Status: Up, vPC number: 3

Hardware: Port-Channel, address: 0005.9b73.41ca (bia 0005.9b73.41ca)

MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA

Port mode is trunk

full-duplex, 10 Gb/s

Beacon is turned off

Input flow-control is off, output flow-control is off

Switchport monitor is off

Members in this channel: Eth1/3

Last clearing of "show interface" counters never

30 seconds input rate 8725464 bits/sec, 792 packets/sec

30 seconds output rate 8766648 bits/sec, 1089 packets/sec

Load-Interval #2: 5 minute (300 seconds)

input rate 6.97 Mbps, 594 pps; output rate 3.38 Mbps, 594 pps

RX

17265037196 unicast packets 826703 multicast packets 7425410 broadcast packets

17273289309 input packets 22486323485032 bytes

11186880177 jumbo packets 0 storm suppression packets

0 runts 0 giants 0 CRC 0 no buffer

0 input error 0 short frame 0 overrun 0 underrun 0 ignored

0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop

0 input with dribble 0 input discard

0 Rx pause

TX

16836762780 unicast packets 181757250 multicast packets 12933897 broadcast packets

17031453927 output packets 10978157509678 bytes

4582723828 jumbo packets

0 output errors 0 collision 0 deferred 0 late collision

0 lost carrier 0 no carrier 0 babble

97689001 Tx pause

2 interface resets

XALBCVMNX01# sh vpc 3

vPC status

----------------------------------------------------------------------------

id Port Status Consistency Reason Active vlans

------ ----------- ------ ----------- -------------------------- -----------

3 Po3 up success success 1,99-100,50

0

XALBCVMNX01# sh int vfc 3

vfc3 is up

Bound interface is port-channel3

FCF priority is 128

Hardware is Virtual Fibre Channel

Port WWN is 20:02:00:05:9b:73:41:ff

Admin port mode is F, trunk mode is on

snmp link state traps are enabled

Port mode is F, FCID is 0xbd0004

Port vsan is 500

5 minute input rate 3479256 bits/sec, 434907 bytes/sec, 99 frames/sec

5 minute output rate 65736 bits/sec, 8217 bytes/sec, 29 frames/sec

3275825238 frames input, 5289641626032 bytes

0 discards, 0 errors

5554313000 frames output, 9557291820036 bytes

0 discards, 0 errors

Interface last changed at Sun Feb 6 08:08:43 2011

Any ideas what could be causing these logs?

Thanks for the help.

rtjensen4 · ‎10-28-2011

I had a similar issue on my N5ks... This does in fact impact performance. After you get the all clear message "

Re enabling dynamic learning on all interfaces", according to some docs I read on Cisco's site, the switch does a MAC table FLUSH. meaning it has to relearn all MAC addresses. While its doing this, traffic is broadcast.

Check the config on the ESX Host and make 100% certain you have the correct ports in the ether channels. If they are, check your load-balancing scheme.

I had this problem with an IBM AIX Virtualization box trying to do Load-balancing. The Hypervisor was configured for redundancy, but the VM was configured for Active/Active etherchannel... that setup caused all sorts of problems on my network and I had the same behavior until I disabled one of the switchports going to the Hypervisor.

HTH

mahbvh · ‎11-18-2011

Hi,

We have sort of the same problem here, except that in my case the MACs are flip-flopping between one vPC member and the vPC peer-link, which is strange because the 5000 should not complain about viewing the same MAC on both sides of the vPC...

Have you had any luck in solving your issue ? In your case it sounds like a mismatch between the ESX physical ports and the mappings to the Virtual Switch.

Cheers,

Vincent.

darkid123 · ‎01-10-2012

Hi,

Sorry for taking forever to reply, but in the end it turns out it looks like it's normal behavior.

I had TAC on a webex and they couldn't see why this was showing up in the logs. Combined with the fact that I had no issues whatsoever with this environment, it doesn't seem to be affecting anything so I left it at that for now.

Thanks again for replying.

Manuel Muetsch · ‎06-14-2012

Hi,

same here after a Peer Keepalive Link failure for 5 seconds (peer timeout) of one N5k.

The impact was really huge.

The Loop came and went till we reloaded the N5k on the other end of the Peer Link.

It's still unclear if it was a bug or hardware issue.

Recording to Cisco the Peer Keepalive Link shouldn't affect the Peer Link in that way!

system: version 5.1(3)N1(1)

If you gathered any new information please share them.

Update:

Had a totally different problem, sry.

One N5k lost for a few seconds connection to every peer (including Peer Keepalive Link, Peer Link, vPC Memberports, non vPC Memberports ...)

The Peer Link and Peer Keepalive Link never got up again properly!

So the N5k thought to be the only active one and produced a nice Loop!

Nachricht geändert durch Manuel Muetsch

mahbvh · ‎06-14-2012

Hi,

FYI, this problem disappeared for me after upgrading to 5.0(3)N1(1c).

More details at this post : https://supportforums.cisco.com/message/3659814#3659814

You're not in the same release train as I am but you may very well be afftected by the same bug CSCto34674 although it doesn't state whether your version is affected.

Hope this helps,

Vincent.

Stan Ngure · ‎10-25-2013

Hi Mahbvh,

I am on the version you upgraded to but the problem persists. it is currently not service affecting.

Oleksandr Nesterov · ‎06-17-2012

Hi

These message logs mean that some mac address flapping between ports po3 and po4 in vlan 100 very quickly. So switch consider this as a network loop and stops learning addresses for some time to protect its control plane.

There can be many reasons for that.

First try to reconfigure both you port-channles from static to LACP mode.

Then check yoir EXS loadbalance algorithm - if both vpc port-chanels are connected to the same ESX and EXS sends traffic for different destinations through different links - same source mac may appear on both port-channels.

Check following commans to see how many mac moves occur between interfaces.

sh mac address-table notification mac-move

If none of the above won't help,

You may also need to open a service requet with the TAC

HTH,

Alex

Christopher Bankhead · ‎04-26-2016

Thanks Alex!. ESX load balance was the solution for us.

Environment:

Multiple Dell M1000e blades chassis with multiple Dell MIO aggregate modules in each chassis. Each Dell MIO aggregate module having two 10GB interfaces in a port-channel running to a pair of Nexus 3Ks (6.0.2.A6.5).

Issue:

Repeatedly logging the two errors below every few minutes.

%FWM-2-STM_LOOP_DETECT:

Disabling dynamic learning notifications for a period between 120 and 240 seconds on vlan

%FWM-2-STM_LEARNING_RE_ENABLE_VLAN

Solution:

Only one of the Dell Chassis had the issue.

e.g. Nexus N3K only logged errors on two specific port-channels connected to the same specific Dell blade chassis. After looking at Vmware Vswitch configurations the chassis where the issue was originating from had a vmware host with different hashing configured on its Vmware Vswitch than the other Vmware hosts.

After changing the hosts VMware VSwitch load balancing from "Route based on IP hash" to “Route based on the originating virtual port ID" the issue has not returned. No more logs or disabled learning.

dae1 · ‎01-14-2013

ESX VLAN Beacon Probing can cause up-port flooding behavior if the vSwitch looses beacons. This is called 'shotgunning' in VMware's terminology.

When we hooked up our HP blade centers to Nexus, we had occassional events when DRS would vMotion a VM and it would seem to land on a new blade, cause a Nexus LOOP_DETECT, and the VM would go off-net for 180 sec.

Disabling Beacon Probing on our vSwitch and vDS up-ports seems to have resolved the problem.

While this is passing thru HP Virtual Connect, the real issue seems to be an interoperability issue between Nexus Loop Detect and ESX Beacon Probing.

Oleksandr Nesterov · ‎01-16-2013

Dean is right

This is recent behvior of ESX

You can get more info on the link below:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1005577

HTH,

Alex

dae1 · ‎01-16-2013

Your mileage with this problem will vary depending on your network topology. If your connecting an ESX Server to Nexus on a single portchannel you probably won't ever see a probelm even with Beacon Probes enabled. If you have dual portchannels like we do from the top of the Virtual Connect Switch, then Beacon Probing is likely to cause LOOP_DETECT events (see: http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c02656171/c02656171.pdf Scenario 3)

Looking back we now believe we had been getting this periodic flooding behavior on our old switch plant, we do not think this is new to ESX. We would see sudden jumps in discard events and we now suspect beaconing was briefly flooding all along. Hooking our blade centers to Nexus introduced new loop prevention logic and made the flooding more noticeable.

We have a lot of VLANs in our ESX infrastructure (~80). Originally we used Beacon Probing in our old switch plant to make sure higher level switches were functioning on all VLANs all the way to the router. Nexus changes the nature of that problem and the probes are no longer as valueable.

zarjer · ‎08-20-2020

what type of loop prevention config can we configure to the switch to prevent this kind of issue on vm switch? we are having issue on the same vm