Solved: Re: Catalyst switches not learning MAC addresses

derek.small · ‎03-11-2022

I've got a head-scratcher. I have a pretty straight-forward customer network. It includes two pairs of Cat 4500Xs-VSS clusters as redundant cores (four switches), and about 10 IDFs using stacks of 2960X or 2960XR switches. All switches have dual 10Gig uplinks to the pair of 4500Xs called Core01, and a single 10Gig uplink to the pair of 4500Xs call Core02. I started noticing excessive drops on on user connections when those connections are 100Mbps. I enabled interface history all on all ports, and can see the same exact profile for drops across all ports on a given switch stack. The spike in drops do not match up to spikes in broadcasts/multicast traffic, and don't even correspond to spikes in overall traffic on any of the ports. Infact the maximum number bytes per second when these drops hit are often not even 1Mbps.

This seems like microbursts, but since the pattern is the same across all switch ports running at 100Mbps, it also seems like it's flooded traffic. I did some packet captures and sure enough I see unicast traffic destined for several systems on ports I should not see that traffic on. A quick check of the MAC table on the switch proved the switch did not have the destination MAC in it's MAC table so the switch was flooding the traffic.

The question is, WHY? I found traffic that has the MAC address being flooded as the source address, from captures done on the switch, which dosen't have the MAC in it's MAC table... Why would the switch not add that MAC address to it's dynamically learned MAC address table. I checked and MAC address learning is not disabled on the vlans or switches in question. The aging time is still the default, 300 seconds. The switches have 200-500 MAC address in their MAC table, and all are showing at least 15000 available (to be learned), so we aren't maxing out the MAC tables.

Are MAC addresses added using some kind of sampling algorithm for incoming traffic? Is there some way to change the timing or parameters for that if there is? I know when switch CPUs get over about 70-80% load, they will not flood packets to ALL ports in a VLan from past investigations, but the CPU load on all the switches in question are hovering around 50% or lower.

Not forgetting about the high drops that got me digging into this, I believe based on the similarity of the drop patterns from port to port, that the problem is from broadcast/multicast/flooded packets. And since we aren't seeing high broadcast/multicast and the spikes we do see in broadcast/multicast don't match up at all to the drops, I believe the drops are due to floods. But I don't know why the switches are flooding.

Anyone have any ideas?

derek.small · ‎07-01-2022

Incase anyone else ever runs into this issue. I finally found the source of the problem. The MAC(s) in question are softconfigured on an IOT device from a certain vendor who should know better (Siemens). This one particular device is configured via a serial connection and a custom app. When you configure it, you configure the MAC and IP address statically. It provides a default value, which unless you know any better, the machine tech never changes. So if you ever end up with 2 or more of these on the network, you have a problem. If you have two or more of them on different Vlans you wouldn't think you would have a problem, but the Cisco switches still don't like seing the same MAC learned on different ports on different Vlans and still report a loop, but now the loop is on the uplink port. At least the switch doesn't shutdown the uplink port, but it does stop learning the "looped" mac address.

The result is a switch somewhere on the network will begin reporting a network loop detected for MAC address xxxx.xxxx.xxxx on ports GigA/B/C and GigX/Y/Z. Once that happens, the switch reporting the loop stops trying to learn the MAC address that it sees as looping. That switch floods the traffic from that device to all ports. The other switches never see that MAC as a source so they don't learn it either, and they all flood the packets for that (those) devices as well.

After we found and re-addressed all the devices causing that problem the issue disappeared.

View solution in original post

Reza Sharifi · ‎03-11-2022

Hi Dereck,

I've got a head-scratcher. I have a pretty straight-forward customer network. It includes two pairs of Cat 4500Xs-VSS clusters as redundant cores (four switches), and about 10 IDFs using stacks of 2960X or 2960XR switches. All switches have dual 10Gig uplinks to the pair of 4500Xs called Core01, and a single 10Gig uplink to the pair of 4500Xs call Core02.

Let me see if I understand this correctly, you are connecting a stack of 2960x or 2960XR using two 10Gig ports to 2 different VSS domains?

2960x1----10g----4500x-vss-domain-1

2960x1----10g----4500x-vss-domain-2

and not

2960x1----10g----4500x-vss-domain-1

derek.small · ‎03-15-2022

So the topology has been in place for several years. There are two core switches. Each IDF has a connection to each core switch. It's a star topology with dual core switches at the center. Each core switch is actually a pair of 4500X using VSS, so each core is logically a single switch. So there are two logical core switches, but four physical core switches. All IDFs have a direct connection to each of the logical core switches. Like this, only there are about 10 IDFs.

------- Core01 (VSS domain 1) ------

(IDF1) 2600X ------| |------2600XR (IDF2)

------- Core02 (VSS domain 2) ------

The problem is the IDF switches are not learning the MACs of all directly attached nodes. Even though I can capture traffic on that IDF switch which clearly shows the node's MAC address as a source in the Ethernet header. It's like the switch just doesn't want to add a MAC entry for some systems. The MAC aging time is the default 300 seconds (5 Mins), so I would expect the switch not to have a MAC entry if the node just doesn't transmit anything for 5 mins. However you can ping the node and get a response, but the node's MAC address still does not appear in the MAC table on the IDF switch.

It's hard to say for sure, but it doesn't seem to be more than about 3-6 MACs that the switch is flooding for any length of time. There are others (MACs), but usually the flooded packets are just the first few packets in a stream. I'm seeing this across most of the IDF switch stacks. The problem seems to scale with the size of the switch stack and the number of active connections. The two IDFs seeing this problem the most are each stacks of 6 switches. The smaller stacks are seeing it as well, but it's not impacting users enough for them to report a problem. You can see it on the interface stats of all IDFs however.

Does anyone know any debug commands that might help show why the switch is not learning some MAC addresses.

derek.small · ‎03-24-2022

So no one has anything? Shame....

paul driver · ‎03-25-2022

Hello

Do you have storm control active on the switches for BUM traffic, Are all access ports set to stp portfast to negate stp learning, if you have multiple stp transitions (faulty /flapping port(s)) depending of the stp mode, the switches would decrease the aging of their cam tables and initiate a re-learning.

Any utilization issues?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

derek.small · ‎03-25-2022

All good suggestions Paul. I am not using Storm control. Since the traffic is Unicast, storm control wouldn't do anything.

We are not seeing any unusal STP activity or port flapping activity. All edge ports have portfast enabled, and none are in blocking or learning state, but that question did cause me to carefully review Spanning-Tree on all Vlans and all ports. All Vlans have the same (correct) Port-Ch1 as the path to root, and all edge ports are in Desg/FWD role/state.

paul driver · ‎03-26-2022

Hello

“Since the traffic is Unicast, storm control wouldn't do anything.”

if you have excess unknown unicast traffic hitting your network then storm control can negate it so not sure why you say it wouldn’t?

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

derek.small · ‎03-26-2022

My applogies Paul. I wasn't aware that unicast was an option for storm-conftrol. I thought it only blocked broadcast and multicast. I will give it a try.

Elliot Dierksen · ‎03-25-2022

Are you sure it isn't learning them? Perhaps it is aging them out. It depends on your topology whether the 4500 would see them or not. What I mean by that is if the host physically connected to a 2960 is talking (unicast) to something else in the same 2960, the 4500 wouldn't see those frames. Thus that MAC address would age out of the CAM table in 4500. That is normal behavior for the CAM table. I have never had to do it, but I would imagine there is a way to change the CAM table aging values.

derek.small · ‎03-27-2022

There is no evidence of the CAM table being overrun. No errors in the log. The current Mac address table is only about 2000-3000 entries during highest loads. It's also not just one address that is being flooded. It's 2 or 3. If you sort thr flooded traffic by number of packets, the top address accounts for about 50% of the flooded traffic with the second accounting for about 20-25%. Then the amount gets to the point where it could be attributed to mac age out or connection startup.

The problem is I can't identify the source of most of the traffic because the switch never learns the Mac, even though the Mac appears as a source in thousands of packets.

I'm about to start doing captures of inbound traffic only on each port on the switch to try to find it, but it will take a while on a stack of 6 48 port switches. I've had some thought about splitting the stack too.

derek.small · ‎03-25-2022

Actually the problem is the worst on the 2960X switches at the edge. I could see it being the MACs just aging out, except that I see packets in the captures where the source MAC address is the address that is being flooded. I wouldn't be surprised to see a couple packets continue to be flooded after that, but it's persistent for several minutes after I see a packet with the flooded MAC as the source. If I check the MAC table while this is happening I do not see the MAC being learned for some reason. I'm guessing it's some kind of a FIB collision or something, but I can't find a way to confirm that, or address it if that is the case.

I've seen this with some server load-balancing techniques. The server responds to ARPs with a MAC address that the servers never use as a source MAC address. So the switch never learns the MAC that systems are sending traffic to. The traffic gets flooded to all the servers (and every freaking other thing on that VLAN) so the servers can pick the connections each one wants to handle. (Yes it's a stupid technique, and is still used by Microsoft today).

I was suspicious this was the same problem, but the systems being flooded are just PCs. They aren't running any weird apps or secondary IP addresses or anything. Just Windows 10 in a basic config.

Elliot Dierksen · ‎03-27-2022

Have you seen any messages about CAM table overflows? If you look at the CAM table, do you see any ports with a suspiciously high number of MAC addresses that aren't trunk ports? I ask because that is a way of gathering information is for a malicious host to flood the CAM table so that unknown unicasts are flooded and the malicious host gets to see traffic it would not normally receive. It is also possible that something is just malfunctioning, not being malicious. I have seen the behavior you are describing with windows load balancing long ago. Distilling it down, I think the key question now is if the switch(es) aren't learning the MAC addresses or are they getting pushed out by a CAM table overflow?

derek.small · ‎07-01-2022

Incase anyone else ever runs into this issue. I finally found the source of the problem. The MAC(s) in question are softconfigured on an IOT device from a certain vendor who should know better (Siemens). This one particular device is configured via a serial connection and a custom app. When you configure it, you configure the MAC and IP address statically. It provides a default value, which unless you know any better, the machine tech never changes. So if you ever end up with 2 or more of these on the network, you have a problem. If you have two or more of them on different Vlans you wouldn't think you would have a problem, but the Cisco switches still don't like seing the same MAC learned on different ports on different Vlans and still report a loop, but now the loop is on the uplink port. At least the switch doesn't shutdown the uplink port, but it does stop learning the "looped" mac address.

The result is a switch somewhere on the network will begin reporting a network loop detected for MAC address xxxx.xxxx.xxxx on ports GigA/B/C and GigX/Y/Z. Once that happens, the switch reporting the loop stops trying to learn the MAC address that it sees as looping. That switch floods the traffic from that device to all ports. The other switches never see that MAC as a source so they don't learn it either, and they all flood the packets for that (those) devices as well.

After we found and re-addressed all the devices causing that problem the issue disappeared.