High CPU due to ARP Input

Sebastian Helmer · ‎02-05-2013

Dear all,

my 3750-E Core Stack is connected to the Provider Router and is the DG for the internal LAN. I saw that the CPU is very high also in the night, but I found not the problem.

I use an SVI to connect the provider due to HA reasons.

I sniffered the network but saw no ecessive broadcaststorms. There was a PBR configured but I deleted it wihtout any success..

Here are some information. Any suggestion or help would be very nice. The default-route is pointed to the providers DG and not to an interface..(U saw that troubleshooting hint already).

switch Version

15.0(1)SE1

10#sh proc cpu so

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

12 2318098713 431897532 5367 61.10% 61.65% 61.97% 0 ARP Input

10#sh platform tcam uti

CAM Utilization for ASIC# 0 Max Used

Masks/Values Masks/values

Unicast mac addresses: 6364/6364 1056/1056

IPv4 IGMP groups + multicast routes: 1120/1120 1/1

IPv4 unicast directly-connected routes: 6144/6144 504/504

IPv4 unicast indirectly-connected routes: 2048/2048 88/88

IPv4 policy based routing aces: 452/452 12/12

IPv4 qos aces: 512/512 21/21

IPv4 security aces: 964/964 37/37

Note: Allocation of TCAM entries per feature uses

a complex algorithm. The above information is meant

to provide an abstract view of the current TCAM utilization

10#sh ip arp sum

519 IP ARP entries, with 1 of them incomplete

10#sh sdm pref

The current template is "desktop default" template.

The selected template optimizes the resources in

the switch to support this level of features for

8 routed interfaces and 1024 VLANs.

number of unicast mac addresses: 6K

number of IPv4 IGMP groups + multicast routes: 1K

number of IPv4 unicast routes: 8K

number of directly-connected IPv4 hosts: 6K

number of indirect IPv4 routes: 2K

number of IPv4 policy based routing aces: 0

number of IPv4/MAC qos aces: 0.5K

number of IPv4/MAC security aces: 1K

Pavol Golis · ‎02-05-2013

Move it to 15.0(1)SE3 which last (and quite good) maintenance release of that train.

I assume this is caused by a lots of ARP hitting CPU (debug if you can, logging console off, logg buffer big, PRAY), if not then CPU Profiling needs to be done on CPU, which out of scope of this forum so you shall involve TAC if you have support on those boxes.

sh ip traffic | b ARP (6x times every 10s and paste here).

Sebastian Helmer · ‎02-07-2013

Thanks,

Ok I will move it to the new code as soon as I get a maintenance windows...until that maybe we found the issue here, I need to travel there I guess would be better for debuging etc.....I have support, maybe I make a TAC if I get no solution from here.

here's the output from the show command.

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610376306 requests, 2132760356 replies, 60131 reverse, 0 other

Sent: 45612351 requests, 2136208385 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610377353 requests, 2132779412 replies, 60131 reverse, 0 other

Sent: 45612351 requests, 2136227471 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610378525 requests, 2132799094 replies, 60131 reverse, 0 other

Sent: 45612351 requests, 2136247202 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610379735 requests, 2132819566 replies, 60131 reverse, 0 other

Sent: 45612352 requests, 2136267698 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610380963 requests, 2132840172 replies, 60131 reverse, 0 other

Sent: 45612352 requests, 2136288347 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610382145 requests, 2132861528 replies, 60131 reverse, 0 other

Sent: 45612352 requests, 2136309732 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

InayathUlla Sharieff · ‎02-07-2013

Hi Sebastian,

As far as I can see kindly log a tac case for this. I believe there is some server which is sending continuos arp and we have to find it out and shut that link down. For this we have to run some debug commands and it would be good if TAC work with you on this.

Recommendation:
1) check if the arp requests would not have been generated by the server NIC defect.

2) look for incorrectly configured host in the network that was generating high volume of arp requests.

Regards

Inayath

Pavol Golis · ‎02-07-2013

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610376306 requests, 2132760356 replies, 60131 reverse, 0 other

Sent: 45612351 requests, 2136208385 replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

10#sh ip traffic | b ARP

ARP statistics:

Rcvd: 1610377353 (DELTA 1047) requests, 2132779412(DELTA 19056) replies, 60131 reverse, 0 other

Sent: 45612351 requests, 2136227471(DELTA 19086) replies (32271832 proxy), 0 reverse

Drop due to input queue full: 24655

So this makes:

+ 105pps RX ARP Requests

+ 1905pps RX ARP Replies

+ 1908pps TX ARP Replies

=> Explains CPU load

=> Switch doesn't send many 1/60s ARP Requests but he receives 1900pps of ARP Replies.

=> Switch receives 100pps ARP Requests, but sends 1900pps of ARP Replies.

I cannot think about scenario when this combination can happen.

Since this is core stack, did you check routing of lower layer nodes ? (to be sure they don't route to "interface")

To really understand what are those do as follows within some safe maintenance window:

1) no logging console

2) logging buffered 10000000

3) debug arp (1 second is enough)

4) undebug all

5) show log.

From the analysis of log you can get some idea what is this all about at network level.

Looks like nice nut to crack

Sebastian Helmer · ‎02-07-2013

Thanks, I already requested a maintenance window and a date for traveling on-site.

I will let u know if the debug helps or what the TAC said.

After TSHOOT I will also move to the new code but not just blind without knowing more details about the problem.

Sebastian Helmer · ‎02-19-2013

Sorry for the very late reply. The problem is fixed. It was confiker once again...unbelievable....a virus which hits the CPU up to 68% by ARP requests.

The Guys on site found it one day befor I want to start to travel to the location...

thanks for all you replies.

Sebastian