From what I came to

bhushit17 · ‎02-24-2016

Hi,

I have a 6509-E chassis deployed as WAN termination router, only having the following links connected to it :-

3 port etherchannel connecting to ISP for 3gig uplink (ports Gi1/1-3)
Downlink-1 1gig (Gi2/1)
Downlink-2 1gig (Gi2/3)
Downlink-3 1gig (Gi6/1)

Now I am getting a high cpu(touching 100%) in working hours 10 A.M. to 6 P.M. following are the outputs from the switch :

#sh mod
Mod Ports Card Type                              Model              Serial No.
--- ----- -------------------------------------- ------------------ -----------
1   24 CEF720 24 port 1000mb SFP              WS-X6824-SFP       SAL1816QMG9
2   48 CEF720 48 port 10/100/1000mb Ethernet WS-X6848-GE-TX     SAL1922FUFC
5    5 Supervisor Engine 2T 10GE w/ CTS (Acti VS-SUP2T-10G       SAL1817R0YH
6    5 Supervisor Engine 2T 10GE w/ CTS (Hot) VS-SUP2T-10G       SAL1815PZPK

Mod MAC addresses                       Hw    Fw           Sw           Status
--- ---------------------------------- ------ ------------ ------------ -------
1 18e7.2820.7918 to 18e7.2820.792f   1.0   12.2(18r)S1 15.1(1)SY2   Ok
2 64f6.9df1.a950 to 64f6.9df1.a97f   1.4   12.2(18r)S1 15.1(1)SY2   Ok
5 6c41.6a0c.1fa2 to 6c41.6a0c.1fa9   1.7   12.2(50r)SYS 15.1(1)SY2   Ok
6 503d.e513.df71 to 503d.e513.df78   1.7   12.2(50r)SYS 15.1(1)SY2   Ok

Mod Sub-Module                  Model              Serial       Hw     Status
---- --------------------------- ------------------ ----------- ------- -------
1 Distributed Forwarding Card WS-F6K-DFC4-A      SAL1815QD6L 2.0    Ok
2 Distributed Forwarding Card WS-F6K-DFC4-A      SAL1922FUFC 1.4    Ok
5 Policy Feature Card 4       VS-F6K-PFC4        SAL1815QA90 2.1    Ok
5 CPU Daughterboard           VS-F6K-MSFC5       SAL1816QYZV 2.1    Ok
6 Policy Feature Card 4       VS-F6K-PFC4        SAL1816QJE0 2.1    Ok
6 CPU Daughterboard           VS-F6K-MSFC5       SAL1814PNPC 2.1    Ok

#sh proc cpu sort 5s | ex 0.00
CPU utilization for five seconds: 92%/56%; one minute: 69%; five minutes: 72%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
442    12748176   4456882       2860 33.43% 4.84% 3.27%   0 IP NAT Ager
483     1917980   6322018        303 0.47% 0.29% 0.30%   0 XDR receive
783    29055636 10320646       2815 0.39% 0.11% 0.13%   0 NF SE Intr Task
481      145972 18540449          7 0.39% 0.31% 0.34%   0 XDR mcast
189    85928236 167659848        512 0.31% 0.85% 0.93%   0 slcp process
680    10455672   2555992       4090 0.23% 0.37% 0.39%   0 Env Poll
713      445108   6783233         65 0.15% 0.03% 0.01%   0 Port manager per
788     1050192    547783       1917 0.07% 0.05% 0.06%   0 OBFL INTR obfl0
438      804628 12303857         65 0.07% 0.04% 0.05%   0 IP Input
78     8163632   1000189       8162 0.07% 0.13% 0.13%   0 SEA write CF pro
832      346904    786932        440 0.07% 0.06% 0.07%   0 FNF Cache Ager P
353      407600 47383871          8 0.07% 0.05% 0.06%   0 EARL Intr Thrtl

T#sh proc cpu his

    6666666666777777777777777888886666666666666666666666666666
    6333322222000003333300000000003333388888999996666699999000
100
90
80                          *****
70 *         ********************     ********************
60 **********************************************************
50 **********************************************************
40 **********************************************************
30 **********************************************************
20 **********************************************************
10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
                                                     1
    9789797788977889899999999999999899988899999998999088999879
    2826588481865069229999599905988609969799729964799042999186
100    * *    *    * *#*#**** **** **   *** *** **** *** *
90 * * * * *   ** **#####***#******#****####** #**# *** *
80 **##*** ******#*##############*###################*#####**
70 ##########################################################
60 ##########################################################
50 ##########################################################
40 ##########################################################
30 ##########################################################
20 ##########################################################
10 ##########################################################
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per minute (last 60 minutes)
              * = maximum CPU%   # = average CPU%
    1                       1                 1111111               1 11 1
    0431122122211233699998990331222121122439890000000523321223422335090090
    0687802818879641899427990649462848615405590000000445088500067285090090
100 *                **   ***              * ********               ******
90 *                ********              **********               ******
80 *                ******#*              ***#***##*               *#****
70 #               **#***##*              **#######*               *###*#
60 #               *########              **#######*              **#####
50 #*              *########              **#######**             *######
40 #**             *########*           * **########* *      *   **######
30 #**      ** ***#########** *      ****#########* *** * *******######
20 ##**************##########**************##########*************#######
10 ###*****##*################********###################################
   0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
             0    5    0    5    0    5    0    5    0    5    0    5    0
                   CPU% per hour (last 72 hours)
                  * = maximum CPU%   # = average CPU%

Kindly let me know if I can do something to lower the cpu usage especially interrupt percent.

Regards,

Bhushit

Giuseppe Larosa · ‎02-25-2016

Hello Bhushit,

>> CPU utilization for five seconds: 92%/56%; one minute: 69%; five minutes: 72%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
442 12748176 4456882 2860 33.43% 4.84% 3.27% 0 IP NAT Ager

The high cpu usage by interrupts is the signal that many traffic flows are process switched in your C6509 instead of being processed by CEF.

Refer to the following documents to review your current configuration:

Best practices for IOS C6500

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/24330-185.html

see for high cpu troubleshooting

https://supportforums.cisco.com/document/59926/troubleshooting-high-cpu-6500-sup720

( I know you have sup2T but it should be valid starting point)

I see that the NAT Ager process is taking 30% of 5seconds CPU so I wonder if NAT is performed in software in your system.

Hope to help

Giuseppe

bhushit17 · ‎02-25-2016

Thanks Giuseppe,

All of my traffic is L3 traffic that's why this high interrupt percentage.

Also how can I make sure that NAT is performed all in hardware only ?

Regards,

Giuseppe Larosa · ‎02-25-2016

Hello Bhushut17,

>> All of my traffic is L3 traffic that's why this high interrupt percentage.

This is not true, the C6500 is a multilayer switch so also L3 routed traffic shoud be performed in hardware.

You need to investigate the reasons for so many traffic flows punted to the main cpu using the methods explained in the second link I have provided in my first post on this thread.

I'm afraid that NAT is performed in software in this moment for some reason to be found, from the current configuration or other reasons.

Hope to help

Giuseppe

bhushit17 · ‎02-25-2016

From what I came to understand, I got the following observations investigating my config and chassis :

Almost all of punted packets are of netflow
Switch is sending ICMP TTL expired messages "171850 time exceeded"
Don't know why this below packet is punted to cpu :

l2idb NULL, l3idb Gi2/1, routine inband_process_rx_packet, timestamp 15:11:45.716
dbus info: src_vlan 0x3F4(1012), src_indx 0x40(64), len 0x4C(76)
bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x7FA3(32675)
cap1 0, cap2 0
C8020900 03F48400 00400000 4C000000 1E000414 22000004 00000000 7FA31383
destmac 18.9C.5D.5E.C8.C0, srcmac 00.03.B2.56.94.D3, shim ethertype CCF0
earl 8 shim header IS present:
version 0, control 64(0x40), lif 16420(0x4024), mark_enable 1,
feature_index 0, group_id 0(0x0), acos 0(0x0),
ttl 14, dti 4, dti_value 0(0x0)
ethertype 0800
protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 40, identifier 2450
df 1, mf 0, fo 0, ttl 126, src 10.10.3.93, dst 103.234.162.1
tcp src 62813, dst 22, seq 2472090721, ack 284071500, win 63552 off 5 checksum 0x53D8 ack

Thanks for help !!

Giuseppe Larosa · ‎02-26-2016

Hello Bhushit17,

netflow packets are generated locally so I would expect them to be process switched.

You have netflow enabled on the links to the ISP?

You should check also netflow activity, but probably Leo is right in his post and that NAT ager process so high can be a sign that you need an IOS upgrade.

A risk with netflow on the C6500 platform is to run out of space on the netflow cache, but you have SUP2T.

However, from the show module I see that you have WS-F6K-DFC4-A and VS-F6K-PFC4. That is the basic module.

With Sup720 the only good components for connectivity to the internet with BGP full tables were the XL versions (BXL and later CXL) for their extended memory able to allocate CEF entries for all the IP prefixes.

http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-virtual-switching-system-1440/product_data_sheet0900aecd806ed759.html

see also

http://www.cisco.com/c/dam/en/us/products/collateral/switches/catalyst-6500-series-switches/C45_652087_00_catalyst_aag.pdf

It is strange I cannot find a public datasheet for sup2T.

However, from the second link I see a limit of 256K IPv4 routes for SUP2T and PFC4, instead SUP2T XL and PFC4 XL can support 1M IPv4 routes.

If you are receiving full BGP table from your upstream provider you have run out of memory for CEF with your current components and some traffic is process switched for this reason.

A full BGP table is in the order of 512000 routes nowday.

The exact number depends on the part of the world where you are, but this is the order.

Hope to help

Giuseppe

Edit:

I see you have only one upstream provider connected with a 3GE bundle.However, if it sending to you the full BGP table you are in trouble as I have explained above.

bhushit17 · ‎02-26-2016

Yes I have netflow enabled on link to ISP

I am only receiving a single route (default route) from isp through bgp so that isnt a problem, also I haven't fully utilized the cef route cache.

High utilization seems to be due to netflow traffic and nat issue due to IOS, also I have an access list to match traffic to be natted on that same interface I have enabled netflow also this can led to packets being process switched.

Can route map instead of acl help me here.

Giuseppe Larosa · ‎02-26-2016

Hello Bhushit17,

OK if you are receiving only a default route from ISP you are fine with your HW components.

Netflow traffic is great because of the variety of traffic flows to and from the internet.

If the netflow cache is full the device tries to send more netflow accounting packets.

You should have 512K entries for the NFC cache.

>> also I have an access list to match traffic to be natted on that same interface I have enabled netflow

For NAT usually you use the ACL in a configuration global statement like

ip nat source inside list 50 interface po1 overload

where po1 is your port-channel to ISP and ACL 50 specify what addresses should be NAT translated.

However, again this type of configuration does not lead to process switching unless you use a log option in the ACL statements.

A route-map for NAT plays the same role of the ACL, but provides added flexibility in the match conditions, so that you can use for example other match criteria rather then only ACLs.

Hope to help

Giuseppe

bhushit17 · ‎03-02-2016

Hi,

If I remove netflow from my WAN interface, switch utilization goes down to 10-15 %.

Without changing any other parameter. Most resource heavy service now is "slcp process".

Regards,

Leo Laohoo · ‎02-25-2016

Hey Giuseppe,

How are you doing, mate? Haven't seen you (and Paolo) for a long time. Where have you two been hiding out?

Giuseppe Larosa · ‎02-26-2016

Hello Leo,

I have been away from the forums for a long time indeed. I cannot answer for Paolo.

I needed a break so I took it.

In any case you and the other guys are doing a nice job in the forums and again compliments for having entered the Hall of Fame. Your effort in the forums is very high and this is well deserved.

You add human touch to the conversations and this is something that is important too.

Best Regards

Giuseppe

Leo Laohoo · ‎02-25-2016

442 12748176 4456882 2860 33.43% 4.84% 3.27% 0 IP NAT Ager

NAT Ager on a Sup2T. Hmmmm ... Sounds very familiar.

Please try upgrading the IOS. I've had similar issues which only resolved after an IOS upgrade.

bhushit17 · ‎02-25-2016

ohh ! you had encountered the same issue, which IOS got the issue resolved.

Thanks,

High CPU utilization on C6509 with VS-SUP2T-10G