cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3487
Views
4
Helpful
8
Replies

Catalyst 3120 Stack + iSCSI + Blade Servers = Output drops?

Mike Hendriks
Level 1
Level 1

We have an HP c7000 blade system chassis with several servers.  These servers connect to a pair of 3120 blade switches.  The external ports on these switches connect to our HP P4300 SAN nodes.  The two NICs on the blade servers are aggregated with LACP.  We're seeing sporadic packet loss as well as poor iSCSI performance out of our SAN.  Some investigation has revealed the following:

tritc3120a#show int gi1/0/1
GigabitEthernet1/0/1 is up, line protocol is up (connected)
  Hardware is Gigabit Ethernet, address is 9c4e.208d.a381 (bia 9c4e.208d.a381)
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 2/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseX
  input flow-control is off, output flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:11, output 00:00:00, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 905047
  Queueing strategy: fifo
  Output queue: 0/0 (size/max)
  5 minute input rate 8188000 bits/sec, 875 packets/sec
  5 minute output rate 2420000 bits/sec, 1195 packets/sec
     7965015587 packets input, 8919151697021 bytes, 0 no buffer
     Received 567914 broadcasts (563894 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 563894 multicast, 0 pause input
     0 input packets with dribble condition detected
     9033618415 packets output, 6447693918530 bytes, 0 underruns
     0 output errors, 0 collisions, 1 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out
tritc3120a#

What concerns me here are the Total output drops counter.  Strange, as the other queues appear okay.

Also, I'm seeing a large number of jumbo frame issues:

tritc3120a#show controllers ethernet-controller gi1/0/1

     Transmit GigabitEthernet1/0/1            Receive
   1000066809 Bytes                       2916853617 Bytes
    440824305 Unicast frames              3669591425 Unicast frames
      2558846 Multicast frames                563900 Multicast frames
       472985 Broadcast frames                  4020 Broadcast frames
            0 Too old frames              2925499950 Unicast bytes
            0 Deferred frames               60954034 Multicast bytes
            0 MTU exceeded frames             408762 Broadcast bytes
            0 1 collision frames                   0 Alignment errors
            0 2 collision frames                   0 FCS errors
            0 3 collision frames                   0 Oversize frames
            0 4 collision frames                   0 Undersize frames
            0 5 collision frames                   0 Collision fragments
            0 6 collision frames
            0 7 collision frames          1539539677 Minimum size frames
            0 8 collision frames           322853270 65 to 127 byte frames
            0 9 collision frames           102469750 128 to 255 byte frames
            0 10 collision frames           86998930 256 to 511 byte frames
            0 11 collision frames          227645008 512 to 1023 byte frames
            0 12 collision frames          295283003 1024 to 1518 byte frames
            0 13 collision frames                  0 Overrun frames
            0 14 collision frames                  0 Pause frames
            0 15 collision frames
            0 Excessive collisions                 0 Symbol error frames
            0 Late collisions                      0 Invalid frames, too large
            0 VLAN discard frames         1095369707 Valid frames, too large
            0 Excess defer frames                  0 Invalid frames, too small
      1244463 64 byte frames                       0 Valid frames, too small
   3944360208 127 byte frames
    411818841 255 byte frames                      0 Too old frames
    629842568 511 byte frames                      0 Valid oversize frames
    186172867 1023 byte frames                     0 System FCS error frames
    443082972 1518 byte frames                     0 RxPortFifoFull drop frame
   3417268809 Too large frames
            0 Good (1 coll) frames
            0 Good (>1 coll) frames

tritc3120a#

The concerning item here are the Too large frames and the Valid frames, too large counters.

These counters are only incrementing on the ports which have the blade servers attached to them.  The external ports connecting to the SAN appliances are not showing the same counter increments.

The servers are running Windows Server 2008 SP2 with the HP NIC teaming software.

MTU on both the NIC software and the blade switches have been left at the defaults.

Any advice would be appreciated.

Thanks,

8 Replies 8

Jayakrishna Mada
Cisco Employee
Cisco Employee

Hi Mike,

Too Large frames are the frames that are larger than 1518. These packets are usually packest with dot1q header on them which are seen only on trunk links. Not sure if your SAN links are also trunk links.

The main concern here is the output drops on the interface.

These drops indicated that the interface is oversubscribed and there is bursty traffic which is causing buffer exhaustion on the interface, you can verify to see if we are running out of buffers using the following command:

show platform port-asic stats drops int gi1/0/1

Do you see drops on the other interface which is part of the port-channel as well ?

Do you have qos enabled on this switch ?

JayaKrishna

Thanks, JayaKrishna,

The too large frames item turned out to be a problem with the HP NIC teaming software on the servers.  I am addressing it with HP.

I have split the interfaces on my servers so that one connects to the production VLAN, and one connects to the iSCSI VLAN.  I am primarily seeing the output drops on the interfaces configured to connect to the iSCSI VLANs.  The 3120 switches in this chassis are stacked, and the iSCSI interfaces are on switch 2 (the non-master).  Unfortunately, when I try to check the drop statistics for an interface on that switch, I receive the following output:

tritc3120a#show platform port-a stats drop gi2/0/5
%Command Rejected: interface 'GigabitEthernet2/0/5' is not local port
tritc3120a#

I am able to get the following output for an interface on unit 1, though I am unsure of how to interpret it:

tritc3120a#show platform port-a stats drop gi1/0/5

  Interface Gi1/0/5 TxQueue Drop Statistics
    Queue 0
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
    Queue 1
      Weight 0 Frames 7582674
      Weight 1 Frames 50
      Weight 2 Frames 0
    Queue 2
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
    Queue 3
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
    Queue 4
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
    Queue 5
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
    Queue 6
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
    Queue 7
      Weight 0 Frames 0
      Weight 1 Frames 0
      Weight 2 Frames 0
tritc3120a#

There is no QOS configured on this stack.  The ports showing output drops are configured as access ports:

tritc3120a#show int gi2/0/5
GigabitEthernet2/0/5 is up, line protocol is up (connected)
  Hardware is Gigabit Ethernet, address is 9c4e.208d.a305 (bia 9c4e.208d.a305)
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseX
  input flow-control is off, output flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 1w1d, output 00:00:10, output hang never
  Last clearing of "show interface" counters 1w3d
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 4295486
  Queueing strategy: fifo
  Output queue: 0/0 (size/max)
  5 minute input rate 788000 bits/sec, 81 packets/sec
  5 minute output rate 846000 bits/sec, 87 packets/sec
     1894620263 packets input, 2283485348557 bytes, 0 no buffer
     Received 2732 broadcasts (1351 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 1351 multicast, 0 pause input
     0 input packets with dribble condition detected
     1959776457 packets output, 2687683551615 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

tritc3120a#show run int gi2/0/5
Building configuration...

Current configuration : 113 bytes
!
interface GigabitEthernet2/0/5
switchport access vlan 34
switchport mode access
spanning-tree portfast
end

tritc3120a#

Hi Mike,

try "remote command 2 show platform port-a stats drop gi2/0/5"

or

session 2 (this will take you to switch 2 in the stack)

and run show platform port-a stats drop gi2/0/5

exit (to exit from member to the stack master)

You will probably see the drops on the same queue 1 and mostly weight 0. This is where all the traffic will go when QOS is disabled on the switch.

The bottom line is the traffic going to the ISCSI devices is bursty in nature or oversubscribing the link. you have to move the devices to a switch interface that has more buffers or you need to take some load off of one ISCSI device and move it to a different one where you are not seeing drops.

JayaKrishna

Thank you, JayaKrishna,

It is my understanding, then, that the switch port buffers to SEND data to the server are overflowing, causing the switch to drop those frames.  I wonder, could the problem be alleviated somewhat by enabling jumbo frames, reducing the frame rate for the same amount of data?

I've also done a comparison with a 3750 switch, and noticed something strange.  Output from the show interfaces command indicates that the 3120 has a maximum output queue size of zero, while the 3750 has 40.  Is this normal?  As far as I can tell, the switches are configured largely indentical:

3750

tritc3750a#show int gi1/0/19

GigabitEthernet1/0/19 is up, line protocol is up (connected)

  Hardware is Gigabit Ethernet, address is 0021.d7f9.3f93 (bia 0021.d7f9.3f93)

  Description: ks00011-port1 edge port

  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,

     reliability 255/255, txload 1/255, rxload 1/255

  Encapsulation ARPA, loopback not set

  Keepalive set (10 sec)

  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX

  input flow-control is on, output flow-control is unsupported

  ARP type: ARPA, ARP Timeout 04:00:00

  Last input never, output 00:00:01, output hang never

  Last clearing of "show interface" counters never

  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0

  Queueing strategy: fifo

  Output queue: 0/40 (size/max)

3120

tritc3120a#show int gi2/0/5

GigabitEthernet2/0/5 is up, line protocol is up (connected)

  Hardware is Gigabit Ethernet, address is 9c4e.208d.a305 (bia 9c4e.208d.a305)

  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,

     reliability 255/255, txload 2/255, rxload 1/255

  Encapsulation ARPA, loopback not set

  Keepalive set (10 sec)

  Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseX

  input flow-control is off, output flow-control is unsupported

  ARP type: ARPA, ARP Timeout 04:00:00

  Last input 4d01h, output 00:00:17, output hang never

  Last clearing of "show interface" counters 2w2d

  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 9099289

  Queueing strategy: fifo

  Output queue: 0/0 (size/max)

Mike,

What you are seeing is cosmetic issue documented under the following defect which also applies for 3120. Also the output queue that you are looking is not related port-asic buffers.

"CSCsz61947  3750 plat shows output queue 0/0(max) under show int after an iosupgrade "

What code are you running on 3120 switch ?

I don't think enabling jumbo frames will resolve it as it might/might not decrease the rate but the packet size is going increase so instead of buffering lets say for example 10 packets (1500*10) the port will be able to  buffer only 2 (9000*2) packets as the buffer size on the port is fixed.

JayaKrishna

Unfortunately, the link you posted does not come up for me.  The switch is running 12.2(50)SE3 - filename cbs31x0-universal-mz.122-50.SE3.bin.  I am planning on installing the latest at my next opportunity.

I ran the show buffers command through the output interpreter, it returned the following:

Buffer elements:

1067 in free list (500 max allowed)

229528197 hits, 0 misses, 1024 created

Public buffer pools:

Small buffers, 104 bytes (total 50, permanent 50, peak 110 @ 7w0d):

49 in free list (20 min, 150 max allowed)

260400968 hits, 239 misses, 414 trims, 414 created

0 failures (0 no memory)

Middle buffers, 600 bytes (total 25, permanent 25, peak 124 @ 7w0d):

23 in free list (10 min, 150 max allowed)

23019020 hits, 821 misses, 2459 trims, 2459 created

0 failures (0 no memory)

Big buffers, 1536 bytes (total 50, permanent 50, peak 110 @ 7w0d):

50 in free list (5 min, 150 max allowed)

151262300 hits, 320 misses, 960 trims, 960 created

0 failures (0 no memory)

VeryBig buffers, 4520 bytes (total 16, permanent 10, peak 16 @ 7w0d):

0 in free list (0 min, 100 max allowed)

59 hits, 3 misses, 5 trims, 11 created

0 failures (0 no memory)

Large buffers, 5024 bytes (total 0, permanent 0):

0 in free list (0 min, 10 max allowed)

0 hits, 0 misses, 0 trims, 0 created

0 failures (0 no memory)

Huge buffers, 18024 bytes (total 0, permanent 0):

0 in free list (0 min, 4 max allowed)

0 hits, 0 misses, 0 trims, 0 created

0 failures (0 no memory)

Interface buffer pools:

FastEthernet0-Physical buffers, 1524 bytes (total 32, permanent 32):

8 in free list (0 min, 32 max allowed)

24 hits, 0 fallbacks

8 max cache size, 8 in cache

62507582 hits in cache, 0 misses in cache

RxQFB buffers, 2040 bytes (total 904, permanent 904):

896 in free list (0 min, 904 max allowed)

2196944 hits, 0 misses

RxQ0 buffers, 2040 bytes (total 1200, permanent 1200):

700 in free list (0 min, 1200 max allowed)

291275731 hits, 0 misses

RxQ1 buffers, 2040 bytes (total 128, permanent 128):

1 in free list (0 min, 128 max allowed)

7535128 hits, 64114 fallbacks

RxQ2 buffers, 2040 bytes (total 128, permanent 128):

1 in free list (0 min, 128 max allowed)

11365639 hits, 88794 fallbacks, 0 trims, 0 created

88794 failures (0 no memory)

RxQ3 buffers, 2040 bytes (total 128, permanent 128):

3 in free list (0 min, 128 max allowed)

33222121 hits, 398862 fallbacks

RxQ4 buffers, 2040 bytes (total 128, permanent 128):

2 in free list (0 min, 128 max allowed)

4575702 hits, 42533 fallbacks

RxQ5 buffers, 2040 bytes (total 128, permanent 128):

64 in free list (0 min, 128 max allowed)

64 hits, 0 misses

RxQ6 buffers, 2040 bytes (total 128, permanent 128):

0 in free list (0 min, 128 max allowed)

128 hits, 0 misses

RxQ7 buffers, 2040 bytes (total 192, permanent 192):

63 in free list (0 min, 192 max allowed)

5893193 hits, 0 misses

RxQ8 buffers, 2040 bytes (total 64, permanent 64):

0 in free list (0 min, 64 max allowed)

99503670 hits, 97318104 misses

RxQ9 buffers, 2040 bytes (total 1, permanent 1):

0 in free list (0 min, 1 max allowed)

1 hits, 0 misses

RxQ10 buffers, 2040 bytes (total 64, permanent 64):

1 in free list (0 min, 64 max allowed)

7560595 hits, 1579604 fallbacks

RxQ11 buffers, 2040 bytes (total 16, permanent 16):

0 in free list (0 min, 16 max allowed)

16 hits, 0 misses

RxQ12 buffers, 2040 bytes (total 96, permanent 96):

0 in free list (0 min, 96 max allowed)

96 hits, 0 misses

RxQ13 buffers, 2040 bytes (total 16, permanent 16):

0 in free list (0 min, 16 max allowed)

16 hits, 0 misses

RxQ15 buffers, 2040 bytes (total 4, permanent 4):

0 in free list (0 min, 4 max allowed)

213114988 hits, 213114984 misses

IPC buffers, 2048 bytes (total 300, permanent 300):

287 in free list (150 min, 500 max allowed)

12971306 hits, 0 fallbacks, 0 trims, 0 created

0 failures (0 no memory)

Jumbo buffers, 9240 bytes (total 200, permanent 200):

200 in free list (0 min, 200 max allowed)

0 hits, 0 misses

 

 

SHOW BUFFERS ANALYSIS

ERROR: Since it's last reload, this router has created or maintained a relatively
large number of 'VeryBig buffers' yet still has very few free buffers.

ERROR: Since it's last reload, this router has created or maintained a relatively
large number of 'FastEthernet0-Physical buffers' yet still has very few free buffers.

The above symptoms suggest that a buffer leak has occurred.

BUFFER LEAK: When a process is finished with a buffer, the process should free the
buffer. A buffer leak occurs when the code forgets to process a buffer, or forgets
to free it after.
It is done with the packet As a result, the buffer pool continues to grow as more
and more packets are stuck in the buffers. Some routers (for example, 2600, 3600,
and 4000 Series) require a minimum amount of I/O memory to support certain interface
processors.
Not Enough Shared Memory for the Interfaces.
NOTE:
(1)Some of the Public Buffer pools should be abnormally large with few free buffers.
After a reload, you may see that the number of free buffers never gets close to
the number of total buffers.
(2)You should check the buffers on a regular basis. Some leaks are slow but others
are very fast.
(3)If you configure or access the router through telnet,you need to check the buffers
on a regular basis via remote access (telnet) before the router hang to see in
which pool is the leak. Once you see that for one pool the total number is increasing
and the free number is low (the faulty pool), you need to capture a 'show buffer
pool  dump'. But if you don't have any memory available on the box, it's too late
to collect the information . You have to collect the information before the hang.
TRY THIS:
Router is running low on shared memory, even after a reload, physically removing
interfaces solves the problem.
This could be a Cisco IOS software bug. Upgrade to the latest version in your release
train to fix known buffer leak bugs. For example, if you are running Cisco IOS
Software Release 11.2(14), upgrade to the latest 11.2(x).
If you need assistence in the IOS upgradation and software download, please check
the below URL: Software Download Center
Commands to check the additional information about the content of the buffers:
show buffer pool (small - middle - big - verybig - large - huge): shows a summary
of the buffers for the specified pool.
show buffer pool (small - middle - big - verybig - large - huge) dump: shows a hex/ASCII
dump of all the buffers of a given pool.
show tech-support of the router.
How can we identify the pool encounters a problem:
(a) If number of misses & creates increases at high rate (as a % of hits)
(b) If consistently low number of buffers in free list
(c) If number of failure or number of  memory increases
REFERENCE: For more information see Troubleshooting Buffer Leaks
REFERENCE: For more information see Troubleshooting Memory Problems


INFO: The buffer counters can be cleared only by reloading the router.

INFO: Interfaces use the 'interface buffer' pools for input and output (I/O). When
there are no more buffers in the interface buffer free list, the router goes to
the public buffer pools as a fallback. Performance is not affected in case of a
fallback. Interface buffers should not be tuned.

Here is the output field terminology for the 'show buffers' command:
  - HITS: The number of buffers that have been requested from the buffer pool.
    This counter provides a mechanism to determine which pool must meet the
    highest demand for buffers.
  - MISSES: The number of times buffers have been requested, but the processor
    has detected a demand for additional buffers, and has been forced to create
    them. Thus this counter represents the number of times the router has been
    forced to create additional buffers.
  - MAX-ALLOWED: The maximum number of buffers in the free-list. If the number of
    buffers 'in free list' is greater than the 'max-allowed' value, the router will
    attempt to trim buffers from the pool. The 'max-allowed' parameter is used to
    prevent a pool from monopolizing buffers that it does not need anymore and free
    this memory back to the system for further use.
  - FREE-LIST: The number of buffers in the pool, ready for use.
  - MIN: The minimum number of buffers from the pool at any given time.
  - TRIMS: When the value 'in free list' exceeds that of 'max allowed' the processor
    trims the buffers.
  - CREATED: The number of buffers that are created when the free-list is less
    than the minimum buffers allowed, or is of zero value.
  - FAILURES: The number of failures met by the packets when there was a failure
    in an attempt to create buffers even after additional buffers were created.
    This counter represents the number of packets that have been dropped due to
    buffer shortage.
  - TOTAL: The total number of used and unused buffers.
  - PERMANENT: Identifies the permanent number of allocated buffers in the pool,
    that cannot be trimmed away.
  - NO MEMORY: The number of failures caused by insufficient memory to create
    additional buffers.
  - INITIAL: The temporary buffers allotted during system reload and for session
    establishments.
  - MAX-FREE & MIN-FREE: The maximum and minimum number of free buffers.

Hi Mike,

Since I have a similar behaviour (on blades w. ESXi), did you find a solution on this?

Br

Benny

Review Cisco Networking products for a $25 gift card