Solved: overruns on a gig interface

paul amaral · ‎11-01-2016

Hi, I have a Gig interface on a cisco 7301 router, Software (C7301-JK9S-M), Version 12.4(25), RELEASE SOFTWARE (fc2), that is having an overrun issue. I know overruns are caused the the interfaces inability to handle packets, not enough free buffers, however this is gig interface that sees at most 160mbs of traffic and looking at the buffers nothing is jumping out at me indicating a buffer issue.

Any idea why I would be seeing these overruns, I ask because i notice some of my ospf neighbors are bouncing at times and I'm assuming its because of the overruns as aside from that i see nothing else wrong on that interface.

tia, Paul

GigabitEthernet0/1 is up, line protocol is up
Hardware is BCM1250 Internal MAC, address is 000e.d64f.b01a (bia 000e.d64f.b01a)

MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 2/255, rxload 36/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is RJ45
output flow-control is XON, input flow-control is XON
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:00, output 00:00:00, output hang never
Last clearing of "show interface" counters 3d19h
Input queue: 0/75/438/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: Class-based queueing
Output queue: 0/1000/64/0 (size/max total/threshold/drops)
     Conversations 0/2/256 (active/max active/max total)
     Reserved Conversations 0/0 (allocated/max allocated)
     Available Bandwidth 700000 kilobits/sec
30 second input rate 144151000 bits/sec, 14045 packets/sec
30 second output rate 9280000 bits/sec, 7500 packets/sec
     1957559959 packets input, 18446744073514908623 bytes, 0 no buffer
     Received 198086 broadcasts, 0 runts, 0 giants, 0 throttles
     1380 input errors, 0 CRC, 0 frame, 1380 overrun, 0 ignored
     0 watchdog, 2041413 multicast, 0 pause input
     0 input packets with dribble condition detected
     1059660812 packets output, 192443807816 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets

Buffer elements:
     1117 in free list (1119 max allowed)
     202791674 hits, 0 misses, 619 created

Public buffer pools:
Small buffers, 104 bytes (total 50, permanent 50, peak 133 @ 2d15h):
     39 in free list (20 min, 150 max allowed)
     7789035 hits, 251 misses, 332 trims, 332 created
     10 failures (0 no memory)
Middle buffers, 600 bytes (total 50, permanent 50, peak 68 @ 1w2d):
     47 in free list (25 min, 150 max allowed)
     9695224 hits, 45 misses, 78 trims, 78 created
     0 failures (0 no memory)
Big buffers, 1536 bytes (total 50, permanent 50):
     50 in free list (5 min, 150 max allowed)
     526420 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
VeryBig buffers, 4520 bytes (total 10, permanent 10):
     9 in free list (0 min, 100 max allowed)
     363027 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Large buffers, 5024 bytes (total 0, permanent 0):
     0 in free list (0 min, 10 max allowed)
     0 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Huge buffers, 18024 bytes (total 0, permanent 0):
     0 in free list (0 min, 4 max allowed)
     0 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)

Interface buffer pools:
Syslog ED Pool buffers, 600 bytes (total 150, permanent 150):
     118 in free list (150 min, 150 max allowed)
     2676 hits, 0 misses
IPC buffers, 4096 bytes (total 2, permanent 2):
     2 in free list (1 min, 8 max allowed)
     0 hits, 0 fallbacks, 0 trims, 0 created
     0 failures (0 no memory)

Header pools:
Header buffers, 0 bytes (total 511, permanent 256, peak 511 @ 1w2d):
     255 in free list (256 min, 1024 max allowed)
     171 hits, 85 misses, 0 trims, 255 created
     0 failures (0 no memory)
     256 max cache size, 256 in cache
     99630361 hits in cache, 0 misses in cache

Particle Clones:
     1024 clones, 1034 hits, 0 misses

Public particle pools:
F/S buffers, 128 bytes (total 512, permanent 512):
     0 in free list (0 min, 512 max allowed)
     512 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
     512 max cache size, 512 in cache
     1034 hits in cache, 0 misses in cache
Normal buffers, 512 bytes (total 2048, permanent 2048):
     2048 in free list (1024 min, 4096 max allowed)
     0 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)

Private particle pools:
GigabitEthernet0/0 buffers, 512 bytes (total 1000, permanent 1000):
     0 in free list (0 min, 1000 max allowed)
     1000 hits, 0 fallbacks
     1000 max cache size, 872 in cache
     3240439856 hits in cache, 0 misses in cache
     14 buffer threshold, 0 threshold transitions
GigabitEthernet0/1 buffers, 512 bytes (total 1000, permanent 1000):
     0 in free list (0 min, 1000 max allowed)
     1000 hits, 0 fallbacks
     1000 max cache size, 872 in cache
     2820147516 hits in cache, 0 misses in cache
     14 buffer threshold, 0 threshold transitions
GigabitEthernet0/2 buffers, 512 bytes (total 1000, permanent 1000):
     0 in free list (0 min, 1000 max allowed)
     1000 hits, 0 fallbacks
     1000 max cache size, 872 in cache
     190489 hits in cache, 0 misses in cache
     14 buffer threshold, 0 threshold transitions
VAM2+ buffers, 544 bytes (total 768, permanent 768):
     0 in free list (0 min, 768 max allowed)
     768 hits, 0 fallbacks
     768 max cache size, 256 in cache
     1628582928 hits in cache, 0 misses in cache

Palani Mohan · ‎11-01-2016

Hi Paul

You have a service interruption (OSPF adj flap) which is for real. You want to get to the bottom of this. Right now, non-0 overruns seem to be the likely cause. I also understand that the Hellos not reaching OR not processed in time by the 7300 is causing the adj flap. I understand this part, thoroughly. My request is to validate/correlate the time/date of the adj flap occurrence in comparison to overruns incrementing.

While 40sec does not appear to be too long a time, it translates to 320 (8*40) time-slices, each lasting 1/125th of a second duration. If the microburst condition lasted for a significant portion of the 40 sec duration, you can expect to see three things:

sustained traffic, well in excess of 160Mbps for that duration

Higher CPU cycles

A much much higher count of overruns

The overrun occurrence is a limitation of the interface controller, way before IOS even sees it. These are hard-coded and are not seen in show buffers output. You can't tune/adjust these. I don't believe Cisco will ever publish this level of information in any documentation. Since this is typically seen in routers that were pre-iPhone generation, at some point, as you move to newer (ASR/1k and such), the overruns will likely go away. The 7301 was launched when hi-speed (usually) meant DS3 and an occasional OC-3 for WAN. Since smartPhones came about, the WAN started exploding and high-speed Ethernet based services became the norm. This development started exposing the limitations of routers such as ISR-G2s (2800/3800s), 7200s (NPE-G2 barely managed) and such. 7301s fall into this category, unfortunately. Most common manifestation was "my router CPU is high".

What you describe seem to be the side effect of policing by the provider. Indiscriminate policing will drop pkts regardless of the fact they are control plane related or not. If your subscribed rate is 200Mbps and you observe 160Mbps average, chances are that you are that you are bursting in excess of contracted rate, which is leading to policing by the provider.

Did the adj flap happen during business day? In some cases, working with tier-3 provider support may yield definitive answer for:

1. What is their policy when they see traffic coming in excess of contracted rate?

2. Did they police at the time/date when the adj flap occurred?

This may help you to get closer to finding answers for why did the OSPF adj flap happen.

Kind regards ... Palani

View solution in original post

Joseph W. Doherty · ‎11-01-2016

This leads me to me other suspicion which is probably more probable, that this gig interface which is terminating a 200Mb metro-e line on the 7300. Could this line be indeed be seeing micro-bursts that are exceeding the 200mb available bandwidth and dropping packets?

Oh, that could very likely be your issue. No wonder you only see 160 Mbps, and your ingress drops/overruns are so, so few.

With MetroE, your SP is probably unable to provide any QoS support.

If this is a multi-point, you cannot optimally manage the bandwidth to your 7300.

You could shape each site so their aggregates don't (or too much) overrun any one site's bandwidth, but then you cannot "share" bandwidth between sites.

The latest DMVPN does support dynamic shaping - I haven't used it - don't know how well it works - likely not supported on the 7300 - in theory, about the most optimal bandwidth management.

View solution in original post

Palani Mohan · ‎11-01-2016

Paul

Odds of OSPF bouncing because of overruns are close to zero. The easiest way to verify this by looking/comparing the overruns count before and after OSPF adj flap. I say easiest on the assumption you have some network management/monitoring tool that you have access to.

Overruns count as a % of the "total input pkts" is very minuscule. Such low number will not affect production traffic nor control plane pkts. For OSPF adj to go down, 4 hello pkts over 40 sec need to be missed. Now, overruns on the other hand occur over a duration of 1/125th of a second. This extremely short duration is also the reason why we don't have visibility.

You mentioned that the rate never crossed 160Mbps. I presume you are looking at show int OR its equivalent via SNMP/MIB that monitors the router. The CLI uses moving average sampling and the default sampling interval is 5 min (300 sec). If the interface overrun happened in one or few samples (that lasted 1/125th of a second), it is not going to be significant enough, to skew the moving average.

The only sure way to track overruns is by placing a Sniffer on the wire, track the show int output to see if it shows increment. If the CLI shows incrementing overrun count, then use Wireshark (Statistics/IO Graph to be precise) and now look for the spike, at around the same/similar time when overruns occurred. Going by the extreme low count, my recommendation is not to pursue this path.

7301 is almost 10 years old and I am guessing it served you well. Time to consider upgrading this platform as its "End of Life" is less than a year from now.

Sincerely ... Palani

paul amaral · ‎11-01-2016

Palani,

you present a very good case and i completely agree with your points, however couldn't a microbrust in the span of over 40 secs cause overruns that would take down a OSPF adj? Although you seem to believe this is highly unlikely and i must say i do as well.

I know for a fact the ospf issue is on the 7300 side because the OSPF neighbor(s) goes down on the 7300 while the remote side just goes back from loading to full, in other words the remote side never missed an ospf hello packet from the 7300 but the 7300 missed hellos from the remote neighbor.

I was just trying to figure out if there was some correlation between the overruns and the OSPF issue but again you present a good argument against that.

This leads me to me other suspicion which is probably more probable, that this gig interface which is terminating a 200Mb metro-e line on the 7300. Could this line be indeed be seeing micro-bursts that are exceeding the 200mb available bandwidth and dropping packets?

FYI we have about 16 OSPF neighbors termination on the 7300 and usually 3 or 4 will go down at the same time, again the remote side just goes back to, loading to full.

I guess from reading your comments and thinking about this the issue might be on of bandwidth. Would you agree?

Also what is causing those overruns, is it a lack of buffers or rx ring limitation? The buffer stats look good to me.

thanks for all your info, very much appreciated.

paul

Palani Mohan · ‎11-01-2016

Hi Paul

You have a service interruption (OSPF adj flap) which is for real. You want to get to the bottom of this. Right now, non-0 overruns seem to be the likely cause. I also understand that the Hellos not reaching OR not processed in time by the 7300 is causing the adj flap. I understand this part, thoroughly. My request is to validate/correlate the time/date of the adj flap occurrence in comparison to overruns incrementing.

While 40sec does not appear to be too long a time, it translates to 320 (8*40) time-slices, each lasting 1/125th of a second duration. If the microburst condition lasted for a significant portion of the 40 sec duration, you can expect to see three things:

sustained traffic, well in excess of 160Mbps for that duration

Higher CPU cycles

A much much higher count of overruns

The overrun occurrence is a limitation of the interface controller, way before IOS even sees it. These are hard-coded and are not seen in show buffers output. You can't tune/adjust these. I don't believe Cisco will ever publish this level of information in any documentation. Since this is typically seen in routers that were pre-iPhone generation, at some point, as you move to newer (ASR/1k and such), the overruns will likely go away. The 7301 was launched when hi-speed (usually) meant DS3 and an occasional OC-3 for WAN. Since smartPhones came about, the WAN started exploding and high-speed Ethernet based services became the norm. This development started exposing the limitations of routers such as ISR-G2s (2800/3800s), 7200s (NPE-G2 barely managed) and such. 7301s fall into this category, unfortunately. Most common manifestation was "my router CPU is high".

What you describe seem to be the side effect of policing by the provider. Indiscriminate policing will drop pkts regardless of the fact they are control plane related or not. If your subscribed rate is 200Mbps and you observe 160Mbps average, chances are that you are that you are bursting in excess of contracted rate, which is leading to policing by the provider.

Did the adj flap happen during business day? In some cases, working with tier-3 provider support may yield definitive answer for:

1. What is their policy when they see traffic coming in excess of contracted rate?

2. Did they police at the time/date when the adj flap occurred?

This may help you to get closer to finding answers for why did the OSPF adj flap happen.

Kind regards ... Palani

paul amaral · ‎11-01-2016

Palani, great post makes lots of sense. I believe the issue is indiscriminate policing by the SP as well due to over subscription of the line form time to time. Unfortunately the snmp monitor that was setup in orion is not working at this time so i can investigate this easily without using capturing packets. I just wanted to make sure the overruns were not the issue. You pose some great questions and simple answer is they do spike the line at various time as the router is one of the main sites.

FYI i use the 7300 because i have a 20 site dmvpn multipoint tunnel configured with ipsec and this router will do over 80mbs of encryption with no additional licenses or problems. I know that with Gen 2 routers with cisco IOS 15 there is a limit on encryption and there additional steps involved to have your EQ go over that 80mbs limit. :(

thanks

paul

Palani Mohan · ‎11-04-2016

Hi Paul

I do understand your predicament! The 7301 is meeting your needs perfectly and no product from the current product line provides equivalent price/performance.

I am sure you know that 7301 will be "End of Life" by Sep/2017. The other thing is that it is very likely that you may increase the WAN bandwidth in the near future. That may be a better time to build a case for why 7301 is ready for an upgrade to an ASR/1k class of device?

One request:

If you have any open questions let us know. If not, may I request that you tag this thread as resolved? I think you need to choose "correct answer" which may be visible to you, alone?

Kind regards ... Palani

Joseph W. Doherty · ‎11-01-2016

This leads me to me other suspicion which is probably more probable, that this gig interface which is terminating a 200Mb metro-e line on the 7300. Could this line be indeed be seeing micro-bursts that are exceeding the 200mb available bandwidth and dropping packets?

Oh, that could very likely be your issue. No wonder you only see 160 Mbps, and your ingress drops/overruns are so, so few.

With MetroE, your SP is probably unable to provide any QoS support.

If this is a multi-point, you cannot optimally manage the bandwidth to your 7300.

You could shape each site so their aggregates don't (or too much) overrun any one site's bandwidth, but then you cannot "share" bandwidth between sites.

The latest DMVPN does support dynamic shaping - I haven't used it - don't know how well it works - likely not supported on the 7300 - in theory, about the most optimal bandwidth management.

paul amaral · ‎11-01-2016

Joseph, is it a multipoint dmvpn phase 2 setup with about 20 sites with ipsec enabled. the sites are all set to 20mbs max and the main site, the 7300, to 200mbs. Its comcast fiber so i dont think there is any qos at all on the SP side and thus anything over the CIR is getting dropped regardless.

Shaping is a good idea, i might look into that or increase the bandwidth at some point. I know dynamic shaping isn't supported on the 10 yo 7300 lol, dmvpn phase 3 isn't even support but these routers are a great cheap workhorse and the fact that the ipsec image will do encryption over the newly imposed 80mbs limit is great.

thanks for your reply.

paul

Joseph W. Doherty · ‎11-01-2016

You're ingress drops and overruns are a very, very low percentage. However, what's happening might be due to a microburst, and if such, if your OSPF hello timers are "tight", it could drop an OSPF neighbor.

A couple of things you might try.

Increase ingress queue depth. Other than queue space RAM usage, I recall there are not other negatives even if you set to max.
Enable buffer auto tune, if supported, or manually tune your buffers to avoid the misses, trims and creates.

paul amaral · ‎11-01-2016

Joseph thanks, see response to Palani, i think you are correct but most specific a bandwidth issue.

paul

margivaria · ‎11-30-2018

is there a command to increase a ingress depth queue on cisco 3945 router? is it per interface basis?

bcoverstone · ‎11-14-2017

Ahh, the good old BCM1250. I used to have thousands of buffer overruns on my router gig ports.

Then I noticed one day that the input and output flow control were set to XON.

I turned flow control on for that switchport, and the buffer overruns disappeared.