6500 packet loss on blade WS-X6748-GE-TX

Sylvain Deschenes · ‎10-03-2011

Hi,

we have a 6509 with ios 12.2.33SXJ

we have 2 WS-X6516-GE-TX, WS-X6516A-GBIC, and a WS-X6748-GE-TX with a WS-F6700-CFC daughtercard

our sup is a WS-SUP720-3B

we are experiencing packet loss for everything connected in the WS-X6748-GE-TX blade, right now we dont have any production device in that blade due to the packet loss we are experiencing.

does anyone have encountered the same problem.

this switch was running hybrid before it is now running native ios, however I can't recall if we didn't have that packet loss before.

do i need to update a firmware of the card or daughtercard (if this is possible, can't say i've done it before).

thank you

Sylvain Deschenes · ‎10-04-2011

I read the release note of 12.2sx

seems like the ROMMON on the WS-F6700-CFC daughtercard was not up to date. I updated it to 12.2(18r)S1 like the release note suggested. however it did not resolved my problem, i'm still experiencing packet lost for devices connected in this blade.

right now the blade is in slot 9 of our 6509. I could put it in blade 1, 2 or 3. would it change something?

thank you

Jon Marshall · ‎10-04-2011

Sylvain

The 6748 module has 2 x 20Gbps connections to the switch fabric. It has 48 10/100/1000Gbps ports. So in theory you can oversubscribe this module but it is unlikely as you would need over 40 ports, or more specifically more than 20 ports per port group to be transmitting 1Gbps simualtenously which is unlikely.

Just to clarify the port group thing. The 6748 as 2 port groups -

group1 = ports 1 - 24

group2 = ports 25 - 48

each port group has access to a 20Gbps connection to the switch fabric.

So if you have more than 20 connected devices per port group transmitting 1Gbps each simultaneously then you do have oversubscription. But as i say this is highly unlikely.

Moving the module to a different slot in the 6509 should make no difference as each each slot provides a maximum of 40Gbps per slot.

Is there any possibility you have enabled QOS but not tuned the buffers accordingly ? Where is the packet loss ie. ingress to the ports or egress from the ports ?

Jon

Leo Laohoo · ‎10-04-2011

1. Post the "sh interface " and what is the uptime of the chassis?

2. Can you also post "sh interface count error" please?

Sylvain Deschenes · ‎10-05-2011

the qos could be the problem i guess, before there was the command: mls qos

while this command was on the switch we experienced packet loss and a delay for our ping,

then we disabled this command, but we still had packet loss but did not have delay anymore

is there a document that could help us configure the qos for this blade?

heres a show interface

we have the problem in all the port of the 6748

for the uptime, the 6500 was updated this weekend soo about 4 days.

thank you

GigabitEthernet9/1 is up, line protocol is up (connected)

Hardware is C6k 1000Mb 802.3, address is 0016.c810.75c0 (bia 0016.c810.75c0)

Description:

MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT

input flow-control is off, output flow-control is on

Clock mode is auto

ARP type: ARPA, ARP Timeout 04:00:00

Last input never, output 00:00:38, output hang never

Last clearing of "show interface" counters never

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 5

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 0 bits/sec, 0 packets/sec

5 minute output rate 51000 bits/sec, 14 packets/sec

185714 packets input, 64272078 bytes, 0 no buffer

Received 2873 broadcasts (0 multicasts)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 watchdog, 0 multicast, 0 pause input

0 input packets with dribble condition detected

4769209 packets output, 2225784348 bytes, 0 underruns

0 output errors, 0 collisions, 4 interface resets

0 babbles, 0 late collision, 0 deferred

0 lost carrier, 0 no carrier, 0 PAUSE output

0 output buffer failures, 0 output buffers swapped out

show interfaces gigabitEthernet 9/1 counters errors

Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards

Gi9/1 0 0 0 0 0 2

Port Single-Col Multi-Col Late-Col Excess-Col Carri-Sen Runts Giants

Gi9/1 0 0 0 0 0 0 0

Port SQETest-Err Deferred-Tx IntMacTx-Err IntMacRx-Err Symbol-Err

Gi9/1 0 0 0 0 0

johnnylingo · ‎02-10-2012

Also seeing output drops, but mine are to a Linux server on a WS-X6748-GE-TX blade. The drops occur when the server reads from a NAS, which has a 10GB connection. Unfortunately, the application does a poor job of handling the drops, and does not support rate limiting. Also, both the server and NAS are on the same subnet, so implementing Layer 3 QoS is not an option.

Is there a good work-around for this scenario? Would flow control help? Or should I look in to increasing the buffer sizes of Queue #1?

#show int gig7/27

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 261639

# show queueing int gig7/27

Packets dropped on Transmit:
BPDU packets: 0

    queue              dropped [cos-map]
    ---------------------------------------------
    1                   261639 [0 1 ]
    2                        0 [2 3 4 ]
    3                        0 [6 7 ]
    4                        0 [5 ]

alexkosykh · ‎11-22-2012

We have same problem on 6509. IOS s72033-advipservicesk9_wan-mz.122-33.SXH2

Sylvain, do you solved problem?

nkarpysh · ‎11-22-2012

Hello Gents,

There are few possible reasons for these kind of problems:

- Pure oversubscription - when several port or Higher speed port sending traffic out of single lower speed port. Line wont be able to send all and start to drop

- QoS tuning is not efficient

- Remote side sending flow control pause frames as it cant handle traffic that fast

- HW problem

- etc

I would recomend to start checking from first one. If you suspect drops - understand first what is traffic coming out of that port, where it is coming from to the switch. Check if oversubscription is happening. Keep in mind module architecture and it's internal oversubscription limits. Check output drops on the interface with "show int" command

For second point - if you suspect QoS, try disabling QoS globally first during MW and see if that improves situatuion then you can TS QoS further if Yes:

http://www.cisco.com/en/US/partner/products/hw/switches/ps708/products_tech_note09186a008074d6b1.shtml

3rd - please check show int and see if Pause counter incrementing - if yes, check the problem on remote side.

4th - try moving link within ports on same LC, different ASIC on same LC, different LC and notice how the drops behave. You can make good decisions based on that.

Please don't hesistate to open TAC acse for this kind of problems to verify it in more details. Each situatuion might be very different so common approach does not work well here for all.

Nik

HTH,
Niko

Sylvain Deschenes · ‎11-23-2012

we solved our problem,

for us this seem like a hardware problem, we contacted TAC and they replaced it no problem,

we have not experienced the problem ever since.

alexkosykh · ‎11-23-2012

Today we moved our links from 1-20 to 25-44 ports. It works!!!

We checked the ports with only one link on the same time from pc to blade. Just move from port to port and ping from PC to Cisco and vice versa. From 1 to 24 ports we saw packets loss.

#ping 192.168.15.150

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.15.150, timeout is 2 seconds:

!.!!.

Success rate is 60 percent (3/5), round-trip min/avg/max = 1/2/4 ms

From 25 to 48 ports works fine without loss.

I saw from tcpdump that all pings from cisco comes to PC. But cisco didn't saw the answers from PC.

13:42:09.919744 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 0, length 80

13:42:09.919758 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 0, length 80

13:42:09.921342 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 1, length 80

13:42:09.921349 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 1, length 80

13:42:11.920571 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 2, length 80

13:42:11.920582 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 2, length 80

13:42:11.921051 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 3, length 80

13:42:11.921058 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 3, length 80

13:42:11.921456 IP 192.168.15.149 > 192.168.15.150: ICMP echo request, id 90, seq 4, length 80

13:42:11.921462 IP 192.168.15.150 > 192.168.15.149: ICMP echo reply, id 90, seq 4, length 80

This is my test config

interface Vlan7
 description 6748 test
 ip address 192.168.15.149 255.255.255.252
end
interface GigabitEthernet9/23
 description test
 switchport
 switchport access vlan 7
 switchport mode access
 spanning-tree portfast
end
Mod Ports Card Type                              Model 
--- ----- -------------------------------------- ------------------
 1   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC 
 2   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC 
 3   16  SFM-capable 16 port 1000mb GBIC        WS-X6516-GBIC 
 4   16  16 port 1000mb MTRJ ethernet           WS-X6416-GE-MT 
 5    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE 
 6    2  Supervisor Engine 720 (Active)         WS-SUP720-3B 
 7   24  24 port 100FX Multi mode               WS-X6324-100FX-MM 
 8   48  SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-TX 
 9   48  CEF720 48 port 10/100/1000mb Ethernet  WS-X6748-GE-TX

Why is ports from 1 to 24 work with packets loss?

nkarpysh · ‎11-26-2012

Hi Alexander,

The problem might be related to load on the ASICs corresponding to those ports 1-24. Some of other links can already carry traffic on link spead. Oversubscription on this module is 1.2:1 meaning that 12 ports sharing 10G ASIC. So if all send traffic on line rate - you will have drops.

Also nothing excluding the bad port NIC - so you can see if moving the link to some other port withing first 24 also solves the problem. Then it would mean some HW problems on single port/ group of port and their ASIC rohini or or ASIC Janus for group of 24 ports.

Nik

HTH,
Niko