Re: High CPU on 6500 caused by NetBackups ???

asaykao73 · ‎02-28-2012

Hi There,

We are experiencing continuing high CPU issues on our 6500 (generally 80+% most times).

#show proc cpu sorted | exc 0.00

CPU utilization for five seconds: 75%/67%; one minute: 69%; five minutes: 68%

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

273 2830496584 994142083 2847 2.95% 1.16% 1.10% 0 Port manager per

77 1364 820 1663 1.59% 0.33% 0.10% 1 SSH Process

118 19546227921584654679 1233 1.35% 1.96% 1.49% 0 IP Input

9 8752212681261616535 693 0.55% 0.79% 0.83% 0 ARP Input

170 134354752 30707951 4375 0.31% 0.17% 0.16% 0 Adj Manager

206 5639136443371313851 0 0.23% 0.50% 0.23% 0 Standby (HSRP)

171 145656920 222502762 654 0.23% 0.19% 0.19% 0 CEF process

3 7512089321019079255 737 0.15% 0.22% 0.57% 0 IP-EIGRP(4): PDM

299 84202084 83653622 1006 0.15% 0.07% 0.08% 0 IPC LC Message H

124 47784372 96042068 497 0.15% 0.06% 0.04% 0 ARP HA

#sh ver

Cisco Internetwork Operating System Software

IOS (tm) s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(18)SXF10, RELEASE SOFTWARE (fc1)

.

cisco WS-C6513 (R7000) processor (revision 1.0) with 458720K/65536K bytes of memory.

Processor board ID TSC072300KA

SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache

I have performed a "debug netdr capture rx" and "debug netdr capture tx" and can see that the majority of packet captured are Netbackups (TCP 13724). Can anyone please tell me why this sort of traffic is being punted to the CPU from the captures below and how we might stop it being punted to the CPU???

------- dump of incoming inband packet -------

interface Te9/8, routine draco2_process_rx_packet_inline

dbus info: src_vlan 0x3FB(1019), src_indx 0x207(519), len 0x59E(1438)

bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)

28020401 03FB0400 02070005 9E000000 00060520 09000040 00000000 03800000

mistral hdr: req_token 0x0(0), src_index 0x207(519), rx_offset 0x76(118)

requeue 0, obl_pkt 0, vlan 0x3FB(1019)

destmac 00.07.B3.0B.B7.40, srcmac 00.0E.D6.0B.9D.C0, protocol 0800

protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 1420, identifier 2675

df 1, mf 0, fo 0, ttl 127, src 172.17.3.51, dst 10.61.78.59

tcp src 13724, dst 50486, seq 3079014292, ack 4193993640, win 65524 off 5 checksum 0x9D45 ack

------- dump of outgoing inband packet -------

interface Te9/2, routine send_one_bufhdr_pkt

dbus info: src_vlan 0x5DC(1500), src_indx 0x207(519), len 0x59E(1438)

bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)

00020000 05DC2C00 02070005 9E000000 00060520 00000040 00000000 03800000

mistral hdr: req_token 0x0(0), src_index 0x207(519), rx_offset 0x76(118)

requeue 0, obl_pkt 0, vlan 0x3FB(1019)

destmac 00.1A.30.2A.0A.40, srcmac 00.07.B3.0B.B7.40, protocol 0800

protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 1420, identifier 12067

df 1, mf 0, fo 0, ttl 126, src 172.17.3.51, dst 10.61.78.59

tcp src 13724, dst 50486, seq 1264642152, ack 4193993640, win 65524 off 5 checksum 0x98C ack

Thanks.

Andy

nkarpysh · ‎02-28-2012

Hi Andy,

Te9/2 and Te9/8 - are they part of ether-channel or there is routing load-balancing one hop before thise router? I see that same IP packets (same source and destination) are coming from both these interfaces.

Can you first of all check the routing for these packets. What interface should they be sent out:

show ip cef 10.61.78.59

and

show ip route 10.61.78.59

If those are sent out of same ints they were receievd on it would trigger the packet to be sent to CPU for ICMP redirect. If there is no routing then it will trigger those to be sent to CPU for ICMP unreachable to be generated.

Please also check the MTU on the outgoing interface - if that is lower than 1438 then packets will be sent to CPU and dropped as Dont Fragment bit is set.

So please check the above first to agree on next plan.

Nik

HTH,
Niko

asaykao73 · ‎02-28-2012

Hi Nik,

Thank you for the reply...

1/ Te9/8 and Te9/2 are not part of any Ether-channel.

The traffic flow looks like this:

172.17.3.51 -> 17rsw01 (Vlan120) -> Te9/8 [14rsw02] Te9/2 -> core17lsr01 -> [cloud] -> 10.61.78.59

The 6500 has the hostname of 14rsw02.

So the packet is received (rx) on Te9/8 and then sent out (tx) Te9/2.

2/ cef and routing table below.

#show ip cef 10.61.78.59

10.61.64.0/19, version 1845943, epoch 2, cached adjacency 10.42.255.197

0 packets, 0 bytes

Flow: AS 0, mask 19

via 10.42.255.197, 0 dependencies, recursive

next hop 10.42.255.197, TenGigabitEthernet9/2.1500 via 10.42.255.197/32 (Default)

valid cached adjacency

#show ip route 10.61.78.59

Routing entry for 10.61.64.0/19

Known via "bgp 64610", distance 20, metric 0

Tag 64600, type external

Redistributing via eigrp 10

Advertised by eigrp 10 metric 10000000 10 255 1 1500

Last update from 10.42.255.197 2w0d ago

Routing Descriptor Blocks:

* 10.42.255.197, from 10.42.255.197, 2w0d ago

Route metric is 0, traffic share count is 1

AS Hops 2

Route tag 64600

3/ MTU on outgoing interface

#show int TenGigabitEthernet9/2

TenGigabitEthernet9/2 is up, line protocol is up (connected)

Hardware is C6k 10000Mb 802.3, address is 0007.b30b.b740 (bia 0007.b30b.b740)

Description: Trunk interface to core17lsr01-TenGig1/1

MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

interface TenGigabitEthernet9/2.1500

description Internal Subinterface

encapsulation dot1Q 1500

ip address 10.42.255.198 255.255.255.252

ip access-group test in

ip access-group test out

#sh access-lists test

Extended IP access list test

10 permit ip host 172.17.2.22 host 10.91.32.10 log-input

20 permit ip host 172.17.2.22 host 10.91.32.11 (615426 matches)

30 permit ip host 10.91.32.10 host 172.17.2.22 log-input

40 permit ip host 10.91.32.11 host 172.17.2.22 (7 matches)

50 permit ip any any (2000903900 matches)

What might be the next step???

thanks.

Andy

nkarpysh · ‎02-28-2012

Ok,

So the problem seems to be first of all these packets:

------- dump of outgoing inband packet -------

interface Te9/2, routine send_one_bufhdr_pkt <=============== Incoming int Te9/2

dbus info: src_vlan 0x5DC(1500), src_indx 0x207(519), len 0x59E(1438)

bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)

00020000 05DC2C00 02070005 9E000000 00060520 00000040 00000000 03800000

mistral hdr: req_token 0x0(0), src_index 0x207(519), rx_offset 0x76(118)

requeue 0, obl_pkt 0, vlan 0x3FB(1019)

destmac 00.1A.30.2A.0A.40, srcmac 00.07.B3.0B.B7.40, protocol 0800

protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 1420, identifier 12067

df 1, mf 0, fo 0, ttl 126, src 172.17.3.51, dst 10.61.78.59

tcp src 13724, dst 50486, seq 1264642152, ack 4193993640, win 65524 off 5 checksum 0x98C ack

See they arrive at Te9/2 while they should arive at Te9/8

Then as per routing they again sent to Te9/2

#show ip cef 10.61.78.59

10.61.64.0/19, version 1845943, epoch 2, cached adjacency 10.42.255.197

0 packets, 0 bytes

Flow: AS 0, mask 19

via 10.42.255.197, 0 dependencies, recursive

next hop 10.42.255.197, TenGigabitEthernet9/2.1500 via 10.42.255.197/32 (Default)

- which will indeed send packets to CPU to generate ICMP redirect.

So you first of all need to check why core send these packets back and not to final destination through the cloud. As timely measyre you can configure "no ip redirect" on Te9/8 and TE9/2 to stop ICMP redirects and possibly stop packets to be punted to CPU. Though that not clear main problem with core routing which should be cleared.

Nik

HTH,
Niko

asaykao73 · ‎02-29-2012

Hi Nik,

Thanks again for the reply.

I'm not sure why the core would send the packet back out to 14rsw02:Te9/2?????

When I do a traceroute from the source Vlan, it routes fine and i don't see it looping from the core back to 14rsw02:Te9/2.

\\ 17rsw01

17rsw01#traceroute 10.61.78.59 source Vlan120

Type escape sequence to abort.

Tracing the route to p01540.internal.vic.gov.au (10.61.78.59)

1 10.42.255.133 4 msec 0 msec 4 msec <-- 14rsw02

2 10.42.255.197 4 msec 4 msec 4 msec <-- core17lsr01

3 10.60.0.34 [MPLS: Labels 1027/19503 Exp 0] 4 msec 4 msec 4 msec

4 10.61.106.5 [MPLS: Label 19503 Exp 0] 8 msec 4 msec 4 msec

5 10.61.106.6 4 msec 4 msec 4 msec

6 10.61.105.107 4 msec 4 msec 4 msec

7 10.61.78.59 4 msec 4 msec 4 msec

interface Vlan120

description Production Vlan

ip address 172.17.3.1 255.255.255.0

ip helper-address 152.147.128.60

ip helper-address 152.147.225.10

no ip redirects

ip directed-broadcast

ip flow ingress

17rsw01#sh ip cef 10.61.78.59

10.61.64.0/19

nexthop 10.42.255.133 TenGigabitEthernet9/8 <-- sends it to 14rsw02

\\ 14rsw02

interface TenGigabitEthernet9/8

description nh17rsw01:Te9/8

ip address 10.42.255.133 255.255.255.252

no ip redirects

ip directed-broadcast

ip route-cache flow

ip summary-address eigrp 10 192.168.110.0 255.255.254.0 5

ip summary-address eigrp 10 152.147.176.0 255.255.248.0 5

ip summary-address eigrp 10 152.147.160.0 255.255.248.0 5

ip summary-address eigrp 10 152.147.128.0 255.255.224.0 5

ip summary-address eigrp 10 10.42.0.0 255.255.192.0 5

ip policy route-map CSE105-DTFDPC

Any further ideas as the routing seems to be fine?

Thanks.

Andy

nkarpysh · ‎02-29-2012

You need to check the routing on the core - to see if that is flapping. It is core who is sending those packets back - so you need to understand what is happening with routing there. As A workaround for High CPU you can configure "no ip redirect" on Te9/2.1500 - did you try it?

As for the core - really look close for the routing - see if this particular route is fresh in your routing protocol database.

These two packets below:

------- dump of incoming inband packet -------

interface Te9/8, routine draco2_process_rx_packet_inline

dbus info: src_vlan 0x3FB(1019), src_indx 0x207(519), len 0x59E(1438)

bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)

28020401 03FB0400 02070005 9E000000 00060520 09000040 00000000 03800000

mistral hdr: req_token 0x0(0), src_index 0x207(519), rx_offset 0x76(118)

requeue 0, obl_pkt 0, vlan 0x3FB(1019)

destmac 00.07.B3.0B.B7.40, srcmac 00.0E.D6.0B.9D.C0, protocol 0800

protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 1420, identifier 2675

df 1, mf 0, fo 0, ttl 127, src 172.17.3.51, dst 10.61.78.59

tcp src 13724, dst 50486, seq 3079014292, ack 4193993640, win 65524 off 5 checksum 0x9D45 ack

------- dump of outgoing inband packet -------

interface Te9/2, routine send_one_bufhdr_pkt

dbus info: src_vlan 0x5DC(1500), src_indx 0x207(519), len 0x59E(1438)

bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)

00020000 05DC2C00 02070005 9E000000 00060520 00000040 00000000 03800000

mistral hdr: req_token 0x0(0), src_index 0x207(519), rx_offset 0x76(118)

requeue 0, obl_pkt 0, vlan 0x3FB(1019)

destmac 00.1A.30.2A.0A.40, srcmac 00.07.B3.0B.B7.40, protocol 0800

protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 1420, identifier 12067

df 1, mf 0, fo 0, ttl 126, src 172.17.3.51, dst 10.61.78.59

tcp src 13724, dst 50486, seq 1264642152, ack 4193993640, win 65524 off 5 checksum 0x98C ack

Those are same in terms of ip addresses - but coming through oposite interfaces. One below has even TTL decremented - so it seems to be reaching core and beeing sent back. Possibly you have some policy-routing for UDP enabled or smth else which still allowing ICMP to pass successfully and UDP to route back. Or the route may just flap - so you need to inspect it on core.

You can share core configuration and "show ip route", "show ip cef" for start if you want me to help with it.

Nik

HTH,
Niko

asaykao73 · ‎02-29-2012

Hi Nik,

I'll update the network diagram because once it hits the core, it's leaked into an INTERNAL VRF and then carried across the MPLS Core to the destination IP..

172.17.3.51 -> Gi13/30 [17rsw01] Te9/8 -> Te9/8 [14rsw02] Te9/2.1500 -> Te1/1.1500 [core17lsr01] Te2/2 -> [core17lsr02] --[MPLS Core]--> 10.61.78.59

\\ core17lsr01

interface TenGigabitEthernet1/1.1500

encapsulation dot1Q 1500

ip vrf forwarding INTERNAL

ip address 10.42.255.197 255.255.255.252

service-policy input IP-MPLS

service-policy output IP-IP

The service policy on this interface just places traffic into it's traffic class (eg: Gold, Silver, Bronze, etc).

Routing Tables on Core Router:

mel80cs17lsr01#sh ip route vrf INTERNAL 10.61.78.59

Routing Table: INTERNAL

Routing entry for 10.61.64.0/19

Known via "bgp 64600", distance 200, metric 0

Tag 64613, type internal

Redistributing via eigrp 15

Advertised by eigrp 15

Last update from 10.60.3.2 2w1d ago

Routing Descriptor Blocks:

* 10.60.3.2 (default), from 10.60.3.2, 2w1d ago

Route metric is 0, traffic share count is 1

AS Hops 1

Route tag 64613

MPLS Required

mel80cs17lsr01#show ip cef vrf INTERNAL10.61.78.59

10.61.64.0/19

nexthop 10.60.0.34 TenGigabitEthernet2/2 label 1027 19503

interface TenGigabitEthernet2/2

description MPLS Network to core17lsr02

mtu 9216

ip address 10.60.0.33 255.255.255.252

ip mtu 9000

mpls mtu 9100

mpls ip

Not sure what more I can look at to see if the route is flapping.

And on another issue (maybe for later), it doesn't explain why the captured netdr RX packets eventhough they are coming in the right interface (Te9/8) are being punted to the CPU.

Thanks.

Andy

nkarpysh · ‎02-29-2012

Few questions:

- do you still got those packets there in CPU or that is stopped?

- can you do a new netdr and see if those are still coming from Te9/2

- did you apply no ip redirect to Te9/2?

Just want to make sure issue is still there.

Nik

HTH,
Niko

asaykao73 · ‎03-01-2012

Hi Nik,

I took a fresh netdr capture yesterday and those packets are still coming in from Te9/2.

I have yet to re-apply "no ip redirect" to Te9/2. Is this just needed on Te9/2.1500 or is it also required on the physical Te9/2 as well???

\\ nh14rsw02

interface TenGigabitEthernet9/2

description Trunk interface to core17lsr01-TenGig1/1

no ip address

mls qos trust dscp

service-policy output EGRESS

!

interface TenGigabitEthernet9/2.1500

description Internal Subinterface

encapsulation dot1Q 1500

ip address 10.42.255.198 255.255.255.252

ip access-group test in

ip access-group test out

Thanks.

Andy

nkarpysh · ‎03-01-2012

Hi Andy,

That should be applied to L3 interface so Te9/2.1500 in your case. Regarding why packets are still sent - you may need to have details debugging and examine the config carefully. Why not opening TAC case and have a webex session with Cisco engineer to check that in more depth?!

I guess some elam packets captures can be helpfull here to undertsand core internal logic for sending those packets back.

Nik

HTH,
Niko

asaykao73 · ‎03-01-2012

Hi Nik,

You've been a great help in narrowing down this issue.

I'll apply "no ip redirects" on the L3 interface Te9/2.1500 and see if that reduces the High CPU.

Will also look at lodging a Cisco TAC case.

Not sure how to rate all your replies, but I'll go back and make sure I rate them all.

Cheers.

Andy

nkarpysh · ‎03-01-2012

No worries Andy,

Please just update the case with the later findings and the root cause. Very interesting to see what is the root cause and good for a sake of this thread documentation.

Nik

HTH,
Niko

asaykao73 · ‎03-07-2012

Hi Nik,

I've had a chance to apply "no ip redirects" on the layer 3 interface Te9/2.1500 and this didn't have much of an impact. CPU is still very high even after applying that command.

interface TenGigabitEthernet9/2.1500

description Internal Subinterface

encapsulation dot1Q 1500

ip address 10.42.255.198 255.255.255.252

ip access-group test in

ip access-group test out

no ip redirects

Any more ideas to try or do we hand this off to the TAC?

Thanks.

Andy

nkarpysh · ‎03-11-2012

Hi Andy,

Was away fro few days - just noticed your reply. Can you please do new Netdr capture and attach that full to this thread.

Please also ttach following commands:

show ver

show proc cpu sort

show proc cpu hist

show int ------- taken 3 times

show span det | i exec|of top|from

will check that once again in more details.

Nik

HTH,
Niko

asaykao73 · ‎03-19-2012

Hi Nik,

Sorry I've also been away so lost visibility of this issue as well.

Can I email you the details instead of attaching it on a public forum?

Thanks.

Andy