01-13-2011 04:08 PM - edited 03-06-2019 02:59 PM
We have an HP c7000 blade system chassis with several servers. These servers connect to a pair of 3120 blade switches. The external ports on these switches connect to our HP P4300 SAN nodes. The two NICs on the blade servers are aggregated with LACP. We're seeing sporadic packet loss as well as poor iSCSI performance out of our SAN. Some investigation has revealed the following:
tritc3120a#show int gi1/0/1
GigabitEthernet1/0/1 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 9c4e.208d.a381 (bia 9c4e.208d.a381)
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 2/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:11, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 905047
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 8188000 bits/sec, 875 packets/sec
5 minute output rate 2420000 bits/sec, 1195 packets/sec
7965015587 packets input, 8919151697021 bytes, 0 no buffer
Received 567914 broadcasts (563894 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 563894 multicast, 0 pause input
0 input packets with dribble condition detected
9033618415 packets output, 6447693918530 bytes, 0 underruns
0 output errors, 0 collisions, 1 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
tritc3120a#
What concerns me here are the Total output drops counter. Strange, as the other queues appear okay.
Also, I'm seeing a large number of jumbo frame issues:
tritc3120a#show controllers ethernet-controller gi1/0/1
Transmit GigabitEthernet1/0/1 Receive
1000066809 Bytes 2916853617 Bytes
440824305 Unicast frames 3669591425 Unicast frames
2558846 Multicast frames 563900 Multicast frames
472985 Broadcast frames 4020 Broadcast frames
0 Too old frames 2925499950 Unicast bytes
0 Deferred frames 60954034 Multicast bytes
0 MTU exceeded frames 408762 Broadcast bytes
0 1 collision frames 0 Alignment errors
0 2 collision frames 0 FCS errors
0 3 collision frames 0 Oversize frames
0 4 collision frames 0 Undersize frames
0 5 collision frames 0 Collision fragments
0 6 collision frames
0 7 collision frames 1539539677 Minimum size frames
0 8 collision frames 322853270 65 to 127 byte frames
0 9 collision frames 102469750 128 to 255 byte frames
0 10 collision frames 86998930 256 to 511 byte frames
0 11 collision frames 227645008 512 to 1023 byte frames
0 12 collision frames 295283003 1024 to 1518 byte frames
0 13 collision frames 0 Overrun frames
0 14 collision frames 0 Pause frames
0 15 collision frames
0 Excessive collisions 0 Symbol error frames
0 Late collisions 0 Invalid frames, too large
0 VLAN discard frames 1095369707 Valid frames, too large
0 Excess defer frames 0 Invalid frames, too small
1244463 64 byte frames 0 Valid frames, too small
3944360208 127 byte frames
411818841 255 byte frames 0 Too old frames
629842568 511 byte frames 0 Valid oversize frames
186172867 1023 byte frames 0 System FCS error frames
443082972 1518 byte frames 0 RxPortFifoFull drop frame
3417268809 Too large frames
0 Good (1 coll) frames
0 Good (>1 coll) frames
tritc3120a#
The concerning item here are the Too large frames and the Valid frames, too large counters.
These counters are only incrementing on the ports which have the blade servers attached to them. The external ports connecting to the SAN appliances are not showing the same counter increments.
The servers are running Windows Server 2008 SP2 with the HP NIC teaming software.
MTU on both the NIC software and the blade switches have been left at the defaults.
Any advice would be appreciated.
Thanks,
01-20-2011 05:17 PM
Hi Mike,
Too Large frames are the frames that are larger than 1518. These packets are usually packest with dot1q header on them which are seen only on trunk links. Not sure if your SAN links are also trunk links.
The main concern here is the output drops on the interface.
These drops indicated that the interface is oversubscribed and there is bursty traffic which is causing buffer exhaustion on the interface, you can verify to see if we are running out of buffers using the following command:
show platform port-asic stats drops int gi1/0/1
Do you see drops on the other interface which is part of the port-channel as well ?
Do you have qos enabled on this switch ?
JayaKrishna
01-24-2011 01:24 PM
Thanks, JayaKrishna,
The too large frames item turned out to be a problem with the HP NIC teaming software on the servers. I am addressing it with HP.
I have split the interfaces on my servers so that one connects to the production VLAN, and one connects to the iSCSI VLAN. I am primarily seeing the output drops on the interfaces configured to connect to the iSCSI VLANs. The 3120 switches in this chassis are stacked, and the iSCSI interfaces are on switch 2 (the non-master). Unfortunately, when I try to check the drop statistics for an interface on that switch, I receive the following output:
tritc3120a#show platform port-a stats drop gi2/0/5
%Command Rejected: interface 'GigabitEthernet2/0/5' is not local port
tritc3120a#
I am able to get the following output for an interface on unit 1, though I am unsure of how to interpret it:
tritc3120a#show platform port-a stats drop gi1/0/5
Interface Gi1/0/5 TxQueue Drop Statistics
Queue 0
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 1
Weight 0 Frames 7582674
Weight 1 Frames 50
Weight 2 Frames 0
Queue 2
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 3
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 4
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 5
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 6
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 7
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
tritc3120a#
There is no QOS configured on this stack. The ports showing output drops are configured as access ports:
tritc3120a#show int gi2/0/5
GigabitEthernet2/0/5 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 9c4e.208d.a305 (bia 9c4e.208d.a305)
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 1w1d, output 00:00:10, output hang never
Last clearing of "show interface" counters 1w3d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 4295486
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 788000 bits/sec, 81 packets/sec
5 minute output rate 846000 bits/sec, 87 packets/sec
1894620263 packets input, 2283485348557 bytes, 0 no buffer
Received 2732 broadcasts (1351 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 1351 multicast, 0 pause input
0 input packets with dribble condition detected
1959776457 packets output, 2687683551615 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
tritc3120a#show run int gi2/0/5
Building configuration...
Current configuration : 113 bytes
!
interface GigabitEthernet2/0/5
switchport access vlan 34
switchport mode access
spanning-tree portfast
end
tritc3120a#
01-28-2011 12:15 PM
Hi Mike,
try "remote command 2 show platform port-a stats drop gi2/0/5"
or
session 2 (this will take you to switch 2 in the stack)
and run show platform port-a stats drop gi2/0/5
exit (to exit from member to the stack master)
You will probably see the drops on the same queue 1 and mostly weight 0. This is where all the traffic will go when QOS is disabled on the switch.
The bottom line is the traffic going to the ISCSI devices is bursty in nature or oversubscribing the link. you have to move the devices to a switch interface that has more buffers or you need to take some load off of one ISCSI device and move it to a different one where you are not seeing drops.
JayaKrishna
01-31-2011 08:30 AM
Thank you, JayaKrishna,
It is my understanding, then, that the switch port buffers to SEND data to the server are overflowing, causing the switch to drop those frames. I wonder, could the problem be alleviated somewhat by enabling jumbo frames, reducing the frame rate for the same amount of data?
01-31-2011 01:23 PM
I've also done a comparison with a 3750 switch, and noticed something strange. Output from the show interfaces command indicates that the 3120 has a maximum output queue size of zero, while the 3750 has 40. Is this normal? As far as I can tell, the switches are configured largely indentical:
3750
tritc3750a#show int gi1/0/19
GigabitEthernet1/0/19 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 0021.d7f9.3f93 (bia 0021.d7f9.3f93)
Description: ks00011-port1 edge port
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is on, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:01, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
3120
tritc3120a#show int gi2/0/5
GigabitEthernet2/0/5 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 9c4e.208d.a305 (bia 9c4e.208d.a305)
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 2/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 4d01h, output 00:00:17, output hang never
Last clearing of "show interface" counters 2w2d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 9099289
Queueing strategy: fifo
Output queue: 0/0 (size/max)
02-01-2011 10:55 PM
Mike,
What you are seeing is cosmetic issue documented under the following defect which also applies for 3120. Also the output queue that you are looking is not related port-asic buffers.
"CSCsz61947 3750 plat shows output queue 0/0(max) under show int after an iosupgrade "
What code are you running on 3120 switch ?
I don't think enabling jumbo frames will resolve it as it might/might not decrease the rate but the packet size is going increase so instead of buffering lets say for example 10 packets (1500*10) the port will be able to buffer only 2 (9000*2) packets as the buffer size on the port is fixed.
JayaKrishna
02-02-2011 11:13 AM
Unfortunately, the link you posted does not come up for me. The switch is running 12.2(50)SE3 - filename cbs31x0-universal-mz.122-50.SE3.bin. I am planning on installing the latest at my next opportunity.
I ran the show buffers command through the output interpreter, it returned the following:
Buffer elements:
1067 in free list (500 max allowed)
229528197 hits, 0 misses, 1024 created
Public buffer pools:
Small buffers, 104 bytes (total 50, permanent 50, peak 110 @ 7w0d):
49 in free list (20 min, 150 max allowed)
260400968 hits, 239 misses, 414 trims, 414 created
0 failures (0 no memory)
Middle buffers, 600 bytes (total 25, permanent 25, peak 124 @ 7w0d):
23 in free list (10 min, 150 max allowed)
23019020 hits, 821 misses, 2459 trims, 2459 created
0 failures (0 no memory)
Big buffers, 1536 bytes (total 50, permanent 50, peak 110 @ 7w0d):
50 in free list (5 min, 150 max allowed)
151262300 hits, 320 misses, 960 trims, 960 created
0 failures (0 no memory)
VeryBig buffers, 4520 bytes (total 16, permanent 10, peak 16 @ 7w0d):
0 in free list (0 min, 100 max allowed)
59 hits, 3 misses, 5 trims, 11 created
0 failures (0 no memory)
Large buffers, 5024 bytes (total 0, permanent 0):
0 in free list (0 min, 10 max allowed)
0 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Huge buffers, 18024 bytes (total 0, permanent 0):
0 in free list (0 min, 4 max allowed)
0 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Interface buffer pools:
FastEthernet0-Physical buffers, 1524 bytes (total 32, permanent 32):
8 in free list (0 min, 32 max allowed)
24 hits, 0 fallbacks
8 max cache size, 8 in cache
62507582 hits in cache, 0 misses in cache
RxQFB buffers, 2040 bytes (total 904, permanent 904):
896 in free list (0 min, 904 max allowed)
2196944 hits, 0 misses
RxQ0 buffers, 2040 bytes (total 1200, permanent 1200):
700 in free list (0 min, 1200 max allowed)
291275731 hits, 0 misses
RxQ1 buffers, 2040 bytes (total 128, permanent 128):
1 in free list (0 min, 128 max allowed)
7535128 hits, 64114 fallbacks
RxQ2 buffers, 2040 bytes (total 128, permanent 128):
1 in free list (0 min, 128 max allowed)
11365639 hits, 88794 fallbacks, 0 trims, 0 created
88794 failures (0 no memory)
RxQ3 buffers, 2040 bytes (total 128, permanent 128):
3 in free list (0 min, 128 max allowed)
33222121 hits, 398862 fallbacks
RxQ4 buffers, 2040 bytes (total 128, permanent 128):
2 in free list (0 min, 128 max allowed)
4575702 hits, 42533 fallbacks
RxQ5 buffers, 2040 bytes (total 128, permanent 128):
64 in free list (0 min, 128 max allowed)
64 hits, 0 misses
RxQ6 buffers, 2040 bytes (total 128, permanent 128):
0 in free list (0 min, 128 max allowed)
128 hits, 0 misses
RxQ7 buffers, 2040 bytes (total 192, permanent 192):
63 in free list (0 min, 192 max allowed)
5893193 hits, 0 misses
RxQ8 buffers, 2040 bytes (total 64, permanent 64):
0 in free list (0 min, 64 max allowed)
99503670 hits, 97318104 misses
RxQ9 buffers, 2040 bytes (total 1, permanent 1):
0 in free list (0 min, 1 max allowed)
1 hits, 0 misses
RxQ10 buffers, 2040 bytes (total 64, permanent 64):
1 in free list (0 min, 64 max allowed)
7560595 hits, 1579604 fallbacks
RxQ11 buffers, 2040 bytes (total 16, permanent 16):
0 in free list (0 min, 16 max allowed)
16 hits, 0 misses
RxQ12 buffers, 2040 bytes (total 96, permanent 96):
0 in free list (0 min, 96 max allowed)
96 hits, 0 misses
RxQ13 buffers, 2040 bytes (total 16, permanent 16):
0 in free list (0 min, 16 max allowed)
16 hits, 0 misses
RxQ15 buffers, 2040 bytes (total 4, permanent 4):
0 in free list (0 min, 4 max allowed)
213114988 hits, 213114984 misses
IPC buffers, 2048 bytes (total 300, permanent 300):
287 in free list (150 min, 500 max allowed)
12971306 hits, 0 fallbacks, 0 trims, 0 created
0 failures (0 no memory)
Jumbo buffers, 9240 bytes (total 200, permanent 200):
200 in free list (0 min, 200 max allowed)
0 hits, 0 misses
SHOW BUFFERS ANALYSIS
ERROR: Since it's last reload, this router has created or maintained a relatively
large number of 'VeryBig buffers' yet still has very few free buffers.
ERROR: Since it's last reload, this router has created or maintained a relatively
large number of 'FastEthernet0-Physical buffers' yet still has very few free buffers.
The above symptoms suggest that a buffer leak has occurred.
BUFFER LEAK: When a process is finished with a buffer, the process should free the
buffer. A buffer leak occurs when the code forgets to process a buffer, or forgets
to free it after.
It is done with the packet As a result, the buffer pool continues to grow as more
and more packets are stuck in the buffers. Some routers (for example, 2600, 3600,
and 4000 Series) require a minimum amount of I/O memory to support certain interface
processors.
Not Enough Shared Memory for the Interfaces.
NOTE:
(1)Some of the Public Buffer pools should be abnormally large with few free buffers.
After a reload, you may see that the number of free buffers never gets close to
the number of total buffers.
(2)You should check the buffers on a regular basis. Some leaks are slow but others
are very fast.
(3)If you configure or access the router through telnet,you need to check the buffers
on a regular basis via remote access (telnet) before the router hang to see in
which pool is the leak. Once you see that for one pool the total number is increasing
and the free number is low (the faulty pool), you need to capture a 'show buffer
pool dump'. But if you don't have any memory available on the box, it's too late
to collect the information . You have to collect the information before the hang.
TRY THIS:
Router is running low on shared memory, even after a reload, physically removing
interfaces solves the problem.
This could be a Cisco IOS software bug. Upgrade to the latest version in your release
train to fix known buffer leak bugs. For example, if you are running Cisco IOS
Software Release 11.2(14), upgrade to the latest 11.2(x).
If you need assistence in the IOS upgradation and software download, please check
the below URL: Software Download Center
Commands to check the additional information about the content of the buffers:
show buffer pool (small - middle - big - verybig - large - huge): shows a summary
of the buffers for the specified pool.
show buffer pool (small - middle - big - verybig - large - huge) dump: shows a hex/ASCII
dump of all the buffers of a given pool.
show tech-support of the router.
How can we identify the pool encounters a problem:
(a) If number of misses & creates increases at high rate (as a % of hits)
(b) If consistently low number of buffers in free list
(c) If number of failure or number of memory increases
REFERENCE: For more information see Troubleshooting Buffer Leaks
REFERENCE: For more information see Troubleshooting Memory Problems
INFO: The buffer counters can be cleared only by reloading the router.
INFO: Interfaces use the 'interface buffer' pools for input and output (I/O). When
there are no more buffers in the interface buffer free list, the router goes to
the public buffer pools as a fallback. Performance is not affected in case of a
fallback. Interface buffers should not be tuned.
Here is the output field terminology for the 'show buffers' command:
- HITS: The number of buffers that have been requested from the buffer pool.
This counter provides a mechanism to determine which pool must meet the
highest demand for buffers.
- MISSES: The number of times buffers have been requested, but the processor
has detected a demand for additional buffers, and has been forced to create
them. Thus this counter represents the number of times the router has been
forced to create additional buffers.
- MAX-ALLOWED: The maximum number of buffers in the free-list. If the number of
buffers 'in free list' is greater than the 'max-allowed' value, the router will
attempt to trim buffers from the pool. The 'max-allowed' parameter is used to
prevent a pool from monopolizing buffers that it does not need anymore and free
this memory back to the system for further use.
- FREE-LIST: The number of buffers in the pool, ready for use.
- MIN: The minimum number of buffers from the pool at any given time.
- TRIMS: When the value 'in free list' exceeds that of 'max allowed' the processor
trims the buffers.
- CREATED: The number of buffers that are created when the free-list is less
than the minimum buffers allowed, or is of zero value.
- FAILURES: The number of failures met by the packets when there was a failure
in an attempt to create buffers even after additional buffers were created.
This counter represents the number of packets that have been dropped due to
buffer shortage.
- TOTAL: The total number of used and unused buffers.
- PERMANENT: Identifies the permanent number of allocated buffers in the pool,
that cannot be trimmed away.
- NO MEMORY: The number of failures caused by insufficient memory to create
additional buffers.
- INITIAL: The temporary buffers allotted during system reload and for session
establishments.
- MAX-FREE & MIN-FREE: The maximum and minimum number of free buffers.
07-13-2011 07:23 AM
Hi Mike,
Since I have a similar behaviour (on blades w. ESXi), did you find a solution on this?
Br
Benny
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide