Expert advise on Catalyst input and output buffers

leonvd79 · ‎08-24-2006

Dear NetPro user,

I have a question regarding underrun and overrun on ethernet interfaces.

I am currently looking into application performance at a customer site.

A single Catalyst 6509 is connecting a dozen Catalyst 3550's and 3750's through 2 port etherchannels.

Due to complaints regarding application performance I am investigating interfaces that report input and output errors.

The nature of the complaints is that application connections are reset during busy periods, especially on months end when financial reports are generated.

Observation: The underrun counters increment on a daily basis. This occurs exclusively on FastEthernet ports of Catalyst 3550 switches that connect up to 48 servers interfaces operating at 100 Mbit/s.

Definition: An underrun is when the transmitter runs at a higher rate than the packets sent to the hardware buffer. This is when the packets are going out faster than the hardware buffer can send them, just the opposite of an overrun.

Obervation: A few servers are attached to a WS-X6548-GE-TX module in a Catalyst 6509 switch. These interfaces, with speed hard-set to 1000 Mbit/s report overruns. The counter increments daily on all RJ-45 ports that operating at 1000 Mbit/s. The utilization history of these ports report peak usage of 12 Mbit/s.

Definition: An overrun is when a receiver receives packets faster than it can transfer them to the hardware buffer. This means that the packets are coming in faster than the hardware buffer can take them.

The path between the access switch and the distribution/core switch consist of four MMF 62.5? fiber strands that are terminated by SX GBIC tranceivers bundled in an etherchannel. Both GigabitEthernet interfaces are free of errors on both sides of the link.

CEF is not running on Catalyst 3550 and 3750 platform.

The buffers on the Catalyst 3550 report misses, hence the switch has an uptime of 1 year and 48 week. Which does not help identifying when these buffer misses occured.

#sh buffer (see attachment)

The buffers on the Catalyst 6509 report misses, the uptime for this switch approx. 4 weeks.

#sh buffer (see attachment)

Interface statistics:

Catalyst 6509:

#sh int gi7/38 | inc overrun

0 input errors, 0 CRC, 0 frame, 724222 overrun, 0 ignored

#sh int gi7/39 | inc overrun

0 input errors, 0 CRC, 0 frame, 6618226 overrun, 0 ignored

Some interfaces increment with 200,000 overruns in a 24-hour period.

Catalyst 3550:

#sh int fa0/36 | i underrun

44168809 packets output, 808277301 bytes, 17158 underruns

The datarate mismatch between remote and local server (transferring data from and to MS-SQL database) is 1000 Mbit/s to 100 Mbit/s. The statistical data from nGenius perfomance manager indicates that the window size adapt to the capacity of the receiver interface. However the frames sent to the receiver commonly exceed the size 1500 bytes, which can cause the big buffers to fill to maximum capacity, causing the switch to drop frames.

I would like to share thoughts with those who have experienced similar problems and could point me in the right direction. Thank you.

--Leon

jackyoung · ‎08-25-2006

What I think there are two issues in this case.

1) There is jumbo frame (>1500 MTU), I belive it is at the GE interface of the server. Could you please confirm it is jumbo frame at the server but normal frame (1500 byte) at the client. It will create the problem. If yes, try to fine tune the server to 1500 byte to match w/ the client.

2) Buffer size may be an issue at 3550 but not an issue at 6509. May need to fine tune the big buffer at 3550. I could like to wait for the MTU fine tune to determine the result. Or will the user NIC be the factor to cause this problem. What the user NIC are you using ? New or old model ?

Hope this helps.

leonvd79 · ‎08-25-2006

Hello Jack,

First of all thank you for your prompt reply.

1.) The frames originating at the server (residing on the Catalyst 6509 operating at 1000 Mbit/s) often exceed the size of 1500 bytes (mostly 1518 bytes in size, equals max. frame size).

2.) The buffer problems are greater from the 3550 perspective than for the 6509. The 6509 seems to have problems handing the frames to the 3550 and therefore overrun the input buffer.

The NIC's used are various HP Gigabit Server Adapters (models NC 7780, 7781 with driver version 8.39.1.0), there are no known problems with the driver software.

The traffic is between front and backend servers. As the datarate mismatch is 1000 > 100 What options do I have in regard to fine tuning the switch buffers?

Thanks again for eleborating.

--Leon

jackyoung · ‎08-27-2006

Hi Leon, you're welcome.

Sorry, you are correct that the MTU should be 1518. If there is no error at server NIC, it means the switch cannot handle such traffic. I still suspect it is due to the the packet size of the server is larger than 1518, so it cause the problem. Try to reduce the packet size to 1518 at server.

If the problem still exists after sync. the packet size, you can follow below link to fine tune the buffer size. I recommend to fine tune the buffer size at 3550 first then 6509 (if there is a need).

http://www.cisco.com/en/US/products/hw/routers/ps133/products_tech_note09186a00800a7b80.shtml

Moreover, there is always a problem if traffic come from high speed to low speed connection. If the problem still exists w/ all efforts, you may need to upgrade the 100Mbps NIC to 1Gbps NIC to match the speed. It was because the 100Mbps cannot handle such high throughput and make the packet stay at switch packet then drop or request resend from the remote end.

Hope this helps.

leonvd79 · ‎08-27-2006

Hello Jack,

I am quite shure it's the datarate mismatch that causes the buffers of the 3550 to reach maximum capacity, especially the big buffers.

I will look into tuning the the maximum segment size on the server side. The receiving server has gigabit capabilities, but however is connected to a 10/100 port on a 3550.

I know for a fact that moving the server to a GigabitEthernet capable 3750 will solve the problem. For the remaining servers on the 3550 I will either tune the application or modify the buffers.

Again thank you for colaberating on this issue.

Best regards,

--Leon

jackyoung · ‎08-28-2006

You're welcome. Hope you can fix it very soon. Do remember to update the result to us.Thx.

glen.grant · ‎08-28-2006

You also have to remember even with a 4 port etherchannel if you have high traffic between 2 devices across this channel all the traffic is going to go down just 1 of those pipes due to the way the etherchannel hashing algorithm works , it is not load balanced across all 4 so this could be a source of your problem also . If you have any kind of traffic analysis program like vitalnet or something like that you could see if any of those links is getting saturated at times and backing up traffic....Also verify if you have hardcoded the switchports then the server ports must be hardcoded to a specific speed/duplex also.