Unicast Flooding on Catalyst 6509s

jmspiers2006 · ‎11-11-2011

Hi all,

Yesterday I discovered quite a bit of unicast flooding going on in one of our data centers. This particular data center is host to over 1,000 physical test servers with several thousand VMs, plus F5s, SAN, and several hundred switches, so I expect it to be pretty chatty. However, the issue seems to be that both of our core 6509s are flooding data out of some, but not all, of their ports. Running a packet sniff on a server attached to one of the cores picked up about 100 MB (capital B) of data per minute, most of which should have been limited to the devices talking to each other. Instead, it was being flooded out.

The topology is:

Edge 6509-1 (FWSM) Edge 6509-2 (FWSM)

| |

Core 6509-1 Core 6509-2

| |

3750 Stack (Top TOR) 3750 Stack (Bottom TOR)

| |

Servers Servers

All 4 of the 6509s are running HSRP and Edge 6509-1 is currently the spanning-tree root for all MST regions.

That's the basic physical topology. For layer 3, we're using VRFs and a couple of 3750s for some minor routing.

I noticed the unicast flooding when I came across one of the 3750s that had an almost full CAM table. I've spent several hours on it, and so far I haven't found a reason.

Here's where I'm at so far:

1. Asymmetrical routing - Not an issue. ARP timeout is 4 hours and mac aging-time is 4 hours 10 seconds

2. TCNs - Not an issue. Using portfast on most servers. I'm sure a couple have been missed, but I still only see one TCN every several hours.

3. CAM table full - Checked all switches several times, none of the tables are overflowing. Some of the 3750s are close due to this issue, but the 6509s are holding less than 6500 MACs.

So I'm curious: Does anyone have any advice on other things to check? On Monday I'm going to open a TAC case with Cisco if it's still unresolved, but I figured I'd check here first. Any advice is appreciated.

IOS on 6509s: 12.2(33)SXI3

core1#show module

Mod Ports Card Type Model Serial No.

--- ----- -------------------------------------- ------------------ -----------

1 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX

2 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX

3 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX

4 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX

5 2 Supervisor Engine 720 (Active) WS-SUP720-3B

6 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX

7 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX

9 8 CEF720 8 port 10GE with DFC WS-X6708-10GE

The Edge 6509s are the same except module 6 is a FWSM.

Thanks,

js

glen.grant · ‎11-12-2011

You can see if this doc sheds any light but that is going to be tough to figure out with that much gear attached.

http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00801d0808.shtml

jmspiers2006 · ‎11-13-2011

Thanks for the reply. I went through that document last week and none of the suggestions in it worked. However, after I wrote my original post I took a different tact in troubleshooting. I was assuming that there was flooding going on because of the number of MACs that the TORs know about AND the rate at which they're learning them. For example, if you clear the dynamic MACs on one of the TORs it will learn about 1,000 MACs within 2-3 seconds. After that it learns them at a rate of about 50-100 per second until it learns 5,000-6,000. The TORs have a 300 second mac aging-time, so that means within 300 seconds they are learning almost as many MACs as the 6509s know about. I know we have a lot of chatter, but it shouldn't be that much, so I was assuming unicast flooding.

What I did Friday, though, was started running brief 10 second packet sniffs and analyzing the data. What I discovered was that there was flooding going on, but it's not the fault of the Ciscos (at least from what I detected in my limited sniffs). What I found was that a significant portion of the flooded traffic was coming from frames destined for Microsoft Network Load Balancers. I just took this job a few months ago so I have no idea how many clusters are out there and what they're doing, but since this test environment is primarily Windows based I'm guessing there are a lot. I'm going to talk to some folks on Monday and see what we need to do to segment the clusters and then see if that fixes the problem.

The one thing that I still don't understand is why some TORs know more MACs than others. Some are holding 100 MACs in their CAM table, some are holding 1,000, 3,000, or 6,000. They're all connected to the same cores and we're not restricting VLANs between them and the 6509s, so you would think they would all know roughly the same amount.

Looks like I have a lot of work left to do on this one...

Thanks,

- js