Invalid VLAN Tags Causing Transmit Errors?

mfarrenkopf · ‎07-19-2011

I've seen a number of documents on troubleshooting errors, but this one has me stumped.

I've got a 6509 with 6748 blades in slots 1 and 2. They are connected to a blade chassis (non-Cisco). Gi1/33 and Gi1/35, and Gi2/33 and Gi2/35, are both EtherChanneled using LACP (Po3 and Po4, respectively).

I am taking transmit errors on Gi1/35 and Gi2/35. There are no signs of other unrelated ports showing errors. But I'm getting transmit errors, overruns, etc. We tried replacing cables today without success. The administrator for the chassis rebooted the chassis switches without success.

Po3 and Po4 are set on my end for 802.1q trunking and to allow VLANs 500 and 546. Native VLAN is 1 and the SVI is shut down (essentially not allowing untagged traffic to go anywhere). After replacing the cables and seeing that the cable labels didn't match the port programming, my question is: is it possible that a VLAN tag is not being accepted by the blade chassis and it's somehow signalling an error? It looks like Po4 was originally supposed to be VLAN 546 only. I'm guessing that Po3 was supposed to be VLAN 500 only; I don't know. The blade chassis is set up for 802.1q trunking and is passing VLAN tags. If it wasn't, we'd be having a bunch of servers down right now. My theory is that 546 traffic is being allowed on Po3 and the blade chassis is somehow rejecting it (and on the Po4 side, 500 traffic is being sent and the blade chassis is rejecting it).

Transmit errors have always been something of an enigma to me, as I've always understood Ethernet gets put on the wire on a full-duplex connection and it just goes. If these connections were on the same blade, I'd wonder if it was the blade, but they're not. They're across two blades and they are the corresponding ports on each blade, which makes me believe there's definitely some kind of interaction between the 6500 and the blade chassis, and not a 6748 problem. But I'm open to other points of view. In terms of utilization, this switch isn't even breaking a sweat. Aggregate traffic across all blades (3 6748, 2 sup 720's, and 2 6148A's) is, at best, 3 Gb aggregate transmit and 2 Gb aggregate receive at peak. I do know about the effects of the 6148A's on the backplane and we're working to get rid of them, but we're still taking these errors at non-peak hours, so I don't think that's related.

Thanks,

Matt

Yogesh Ramdoss · ‎07-19-2011

Hello Matt,

When the remote-end device reject/drop a frame with wrong dot1q encapsulation, there is no signalling methodology

to inform local device. Only mechanism we have to devices sync the vlans is via VTP.

By "transmit errors" I assume you are referring to "output errors" reported under "show interface X" command.

Please confirm.

6509# show interface po100

0 output errors, 0 collisions, 0 interface resets <<===

Did you have chance to capture "show counters interface x/y" ? This command provides all the hardware counters

and help us to find further details.

For e.g.,

6509#show counters interface te1/1

32 bit counters:

5. txCollisions = 0

11. txDelayExceededDiscards = 0

12. txCRC = 0

13. linkChange = 4

All Port Counters

13. XmitErr = 0

16. SingleCol = 0

17. MultiCol = 0

18. LateCol = 0

19. ExcessiveCol = 0

20. CarrierSense = 0

24. OutDiscards = 0

26. OutErrors = 0

28. txCRC = 0

31. WrongEncap = 0

46. Jabbers = 0

47. Collisions = 0

48. DelayExceededDiscards = 0

Unless, we know which type of error is reported as "output errors", it will be challenging to know the root cause.

I do not think vlan tag mismatch will cause Tx Error on 6500 side, rather it may be reported as "InErrors" or

"WrongEncap" or similar in the 3rd party device.

Regards,

Yogesh

mfarrenkopf · ‎07-19-2011

Yeah, I was afraid you were going to say that. I appreciate the input. I agree -- I know of no mechanism on Ethernet that would allow the blade chassis to tell the 6500 that the tag was not being allowed.

Here's what Po4 looks like (I've cut things that are just informational):

DCRzH1#sh int po4

Port-channel4 is up, line protocol is up (connected)

Description: LACP w/Gi2/33, Gi2/35

MTU 1500 bytes, BW 2000000 Kbit, DLY 10 usec,

reliability 255/255, txload 1/255, rxload 1/255

Full-duplex, 1000Mb/s

input flow-control is off, output flow-control is off

Members in this channel: Gi2/33 Gi2/35

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 74981250

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 3572000 bits/sec, 503 packets/sec

5 minute output rate 7277000 bits/sec, 1410 packets/sec

26851861583 packets input, 30907881065759 bytes, 0 no buffer

Received 34749726 broadcasts (32237850 multicasts)

0 runts, 24 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 watchdog, 0 multicast, 0 pause input

0 input packets with dribble condition detected

15114451769 packets output, 8782295634353 bytes, 0 underruns

497456 output errors, 0 collisions, 1 interface resets

0 babbles, 0 late collision, 0 deferred

0 lost carrier, 0 no carrier, 0 PAUSE output

0 output buffer failures, 0 output buffers swapped out

Here is the show counters output for All Counters for po4. Anything that is zero I've deleted.

All Port Counters

1. InPackets = 17095751146

2. InOctets = 16570219490228

3. InUcastPkts = 16718006789

4. InMcastPkts = 374684861

5. InBcastPkts = 3059155

6. OutPackets = 4835748483

7. OutOctets = 1565461816054

8. OutUcastPkts = 4120133145

9. OutMcastPkts = 638348680

10. OutBcastPkts = 77266658

12. FCSErr = 1

13. XmitErr = 497477

14. RcvErr = 20

21. Runts = 6

22. Giants = 33

23. InDiscards = 282456706

24. OutDiscards = 451911727

25. InErrors = 20

26. OutErrors = 497477

28. txCRC = 248749

29. TrunkFramesTx = 4397110438

30. TrunkFramesRx = 16720518234

35. rxTxHCPkts64Octets = 5443938905

36. rxTxHCPkts65to127Octets = 4274742469

37. rxTxHCPkts128to255Octets = 670188140

38. rxTxHCPkts256to511Octets = 77236322

39. rxTxHCpkts512to1023Octets = 79033724

40. rxTxHCpkts1024to1518Octets = 953916423

42. CRCAlignErrors = 1

50. qos0Outlost = 450916773

88. qos0Inlost = 282456686

105. Overruns = 282456686

We don't have QoS turned on for this switch.

It's possible there's a hardware failure going on here. Across two blades, I doubt it's the 6500. Then again, the errors are occurring across two separate switches in the blade chassis, so it's hard to point to a double hardware failure there.

In Po3, Gi1/33 and Gi1/35 are members. In Po4, Gi2/33 and Gi2/35 are members. Our monitoring doesn't show any other errors on the switch that would be causing this (i.e. no receive errors on the uplinks). We are also using this as a layer 3 standalone device -- all VLANs on the 6500 are isolated to this chassis. No VLANs are spanned elsewhere.

Po4 is running just under 100 Mb transmit and around 50 Mb receive as I type this. Neither direction has broken 100 Mb this morning, so it's not a bandwidth issue.

Thanks,

Matt

mfarrenkopf · ‎07-19-2011

I'm starting to think I'm looking at a broadcast issue. I can see how broadcast responses could be filling transmit buffers. And indeed, there's some odd traffic, so I'm investigating that with our server admin. But still, if anyone else sees anything, I'd be grateful for the input.

Thanks,

Matt

Yogesh Ramdoss · ‎07-21-2011

Matt,

Output Drops under "show interfaces" and QoSLost (in and out) indicates excessive/bursty traffic. Overruns indicates over-subscription and fabric flow-control occurring in the switch.

DCRzH1#sh int po4

Port-channel4 is up, line protocol is up (connected)

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 74981250

497456 output errors, 0 collisions, 1 interface resets

50. qos0Outlost = 450916773

88. qos0Inlost = 282456686

105. Overruns = 282456686

You are on the right track. First thing I would do is to find the source for broadcast storm. Also, in the switch you can enable a GOLD test to find which module is getting oversubscribed and initiating flow-control.

Do following commands under "config t":

diagnostic monitor Module 5 test TestFabricFlowControlStatus cardindex

diagnostic monitor interval Module 5 test TestFabricFlowControlStatus 00:00:00 100 0

First command enables the test, and second one set the interval to 100 msec (which is recommended only for troubleshooting).

Here, mod #5 is the active sup engine.

Once enabled, you can do "show diagnostic events" to find the results.

Hope it helps.

Regards,

Yogesh

mfarrenkopf · ‎07-21-2011

Forgive my paranoia, but these commands are nondisruptive, correct?

This is a switch in our data center, so before I start running diagnostics on it I want to make sure that it's not going to cause any traffic disruptions.

I already hooked up my laptop on a trunk link for these and saw a number of multicasts to an unknown MAC (HP blade chassis heartbeats), plus also saw broadcast UDP traffic to port 16464 (source IP was the server, destination 255.255.255.255, both source and destination UDP port were 16464). Not sure what that's for. There were a few servers doing the UDP broadcasts, so I already sent that to our server team for investigation.

Thanks,

Matt

Yogesh Ramdoss · ‎07-22-2011

Matt,

These commands are non-disruptive. Having the test interval as 100 msec is aggressive. So, I would recommend you to run it for a short time (5-10 mins) and disable it.

Yogesh