cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2033
Views
0
Helpful
4
Replies

Optimal mls fast aging settings for Netflow TCAM cache

cmorledge
Level 1
Level 1

I have a Cisco 7606 and several Cisco 6509s with Sup 720 3BXLs (along with the compatible 3BXL distributed forwarding cards on my other line cards). I am running into some resource problems whereby the TCAMs will get overrun at various times. I'll see stuff like this:

%EARL_NETFLOW-DFC1-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [95%]

I am graphing L3 learned flow failures via SNMP (cseL3FlowLearnFailures), and I am seeing that a lot of flows are getting kicked over to the 6509/7606 CPU when the TCAM resources get exhausted.

When this happens, the DNS infrastructure on our campus gets really mad. I am assuming that punting new flows to the Cisco CPU is causing some performance issues, since our beefy DNS infrastructure will start pounding the network with more and more DNS requests as there are more and more outstanding recursive queries going off-campus. So it looks like our DNS infrastructure is only making matters worse for our Cisco routers by demanding more flow resources just when the routers need to start purging more flow entries in the TCAMs to make room for new entries!

Our 7606 sits on our campus perimeter, so it bears the brunt of the load. Sometimes the router will even reboot as the NDE process sucks up the CPU:

%SYS-DFC6-3-CPUHOG: Task is running for (2000)msecs, more than (2000)msecs (15/8),process = NDE - IPV4.

So, I am trying to figure out a way to tune the ability of the router to be more efficient when handling TCAM resources. For now, I have left the mls normal and long aging timers at their defaults of 300 and 1920 seconds respectively, with no packet threshold.

I am focusing on the fast aging timer. When I first changed the default setting to 128 seconds with a 100 packet threshold, things did get better. However, it still isn't good enough (several router crashes).

My requirements are that I do full flow collection, with NO sampling, and no aggregation.

Given that, are there any recommendations for setting the mls fast aging timer to help me better deal with my DNS issues without unnecessarily overloading the export process and the downstream collectors with too many new flow records per second?

Here is what I am trying as of today:

sh mls netflow aging

enable timeout packet threshold

------ ------- ----------------

normal aging true 300 N/A

fast aging true 30 16

long aging true 1920 N/A

I am running SXH3 on the 6509s and SRC1 on the 7606.

Thanks.

Clarke Morledge

College of William and Mary

1 Accepted Solution

Accepted Solutions

Clarke,

I recommend you increase the threshold to 50 or 100. Higher threshold (or lower aging time) value means more aggressive. Watch your switching processor performance (too aggressive configuration may cause problems with overall stability) - You can use the following commands to get switch CPU stats:

attach (active PFC)

show proc cpu

exit

Regarding to full TCAM:

I think that Cisco is using bad method for full TCAM. I suppose that if TCAM is full, PFC clear all flows from TCAM (or most of flows) ;-( It explains why after TCAM overflow is only 20% utilization. I saw this case many times (maybe some cisco engineer can explain it???)

In any case, if you have a many connections you will not be able to export all flows information. Good command to get number of TCAM creaton failures is:

show mls netflow table-contention aggregate

And sorry for the delay, I was on bussiness trip.

Have a nice day,

Jan

View solution in original post

4 Replies 4

Jan Nejman
Level 3
Level 3

Hello,

I tried tune aging timers, partially with success. We have several routers with 10GE interfaces, and there is NOT possible to fully tune it. You can only optimize aging to get a maximum, but sevaral flows can be still missed. My results are in the table:

http://netflow.cesnet.cz/mls_aging.xls

I recommend you set the fast aging to value between 16 and 20 w/ thr 100, normal aging to 60, long aging to 300. But be careful when you modifing aging timers (watch your switching processor performance, usually router/MSFC is OK).

Please, let me know if you want to know more details about tests and/or tuning.

Kind regards,

Jan Nejman

Caligare, Co.

http://www.caligare.com/

Jan,

I actually gleaned a lot of information already from your Caligare website. Very helpful.

I am currently setting the fast aging timer to 30 seconds with a 16 packet threshold. It made a significant difference and I haven't had problem with my router going belly up due to TCAM overflows. Most of my short lived, UDP flows are DNS, so 16 seemed like a good threshold to me.

I do wonder if I am a bit too aggressive, though. I do not have any hard evidence yet, but I am concerned that either the router is not exporting all of the flows -- or if my collector is not keeping up. It just looks like I am not getting all of my flow information.

Furthermore, I do not really understand how these TCAMs are getting managed. In some cases, when the TCAM gets full (or near full -- at about 95% or more), within a few seconds the TCAM may drop down to less than 20% full. I find it hard to believe that this is simply because lots of flows are just timing out. I wonder whether or not something gets whacked in the TCAMs in some corner cases where the resource gets full.

Also, I have noticed that the size of the Netflow cache will vary when I change ACLs associated with route maps for Policy-Based Routing. I was never told that ACLs used with PBR share the same TCAM space. So this is puzzling.

Do you have any more insight here?

Thanks.

Clarke Morledge

College of William and Mary

Clarke,

I recommend you increase the threshold to 50 or 100. Higher threshold (or lower aging time) value means more aggressive. Watch your switching processor performance (too aggressive configuration may cause problems with overall stability) - You can use the following commands to get switch CPU stats:

attach (active PFC)

show proc cpu

exit

Regarding to full TCAM:

I think that Cisco is using bad method for full TCAM. I suppose that if TCAM is full, PFC clear all flows from TCAM (or most of flows) ;-( It explains why after TCAM overflow is only 20% utilization. I saw this case many times (maybe some cisco engineer can explain it???)

In any case, if you have a many connections you will not be able to export all flows information. Good command to get number of TCAM creaton failures is:

show mls netflow table-contention aggregate

And sorry for the delay, I was on bussiness trip.

Have a nice day,

Jan

Jan,

I'll keep tweaking the threshold as you suggest, but fortunately the CPU appears to be keeping up with the short threshold.

But I do wonder if the flow exporter will still be able to work well if the aging method is configured to be more aggressive. The process may not be CPU bound but perhaps it won't be able to export all of the flows at the higher flow refresh rate -- perhaps due to packet buffer restraints, or some other non-CPU resource? Have you had any experience with that, and what would look for on the router to determine where the problem might be.

I have noticed that my collector is not always reporting all of the flow information being exported by the router, but it appears to be that my collector is keeping up (though I could be wrong on that). I just want to figure out if I can safely rule out anything else on the router.

As to the optimal aging timer, I wish there was a way to compare how many flows are getting aged out by the normal aging timer, the fast aging timer, the long aging timer, TCP RST/FIN termination, and the TCAM being full. Do you have any clue as whether or not you can get a statistical break out of the different flow aging mechanisms?

Regarding the TCAM flushing mechanism when the TCAM gets full, I am glad to know that I'm not the only one to have seen that type of behavior.

Thanks.

Clarke

Review Cisco Networking for a $25 gift card