07-19-2011 07:26 AM - edited 03-07-2019 01:16 AM
I've seen a number of documents on troubleshooting errors, but this one has me stumped.
I've got a 6509 with 6748 blades in slots 1 and 2. They are connected to a blade chassis (non-Cisco). Gi1/33 and Gi1/35, and Gi2/33 and Gi2/35, are both EtherChanneled using LACP (Po3 and Po4, respectively).
I am taking transmit errors on Gi1/35 and Gi2/35. There are no signs of other unrelated ports showing errors. But I'm getting transmit errors, overruns, etc. We tried replacing cables today without success. The administrator for the chassis rebooted the chassis switches without success.
Po3 and Po4 are set on my end for 802.1q trunking and to allow VLANs 500 and 546. Native VLAN is 1 and the SVI is shut down (essentially not allowing untagged traffic to go anywhere). After replacing the cables and seeing that the cable labels didn't match the port programming, my question is: is it possible that a VLAN tag is not being accepted by the blade chassis and it's somehow signalling an error? It looks like Po4 was originally supposed to be VLAN 546 only. I'm guessing that Po3 was supposed to be VLAN 500 only; I don't know. The blade chassis is set up for 802.1q trunking and is passing VLAN tags. If it wasn't, we'd be having a bunch of servers down right now. My theory is that 546 traffic is being allowed on Po3 and the blade chassis is somehow rejecting it (and on the Po4 side, 500 traffic is being sent and the blade chassis is rejecting it).
Transmit errors have always been something of an enigma to me, as I've always understood Ethernet gets put on the wire on a full-duplex connection and it just goes. If these connections were on the same blade, I'd wonder if it was the blade, but they're not. They're across two blades and they are the corresponding ports on each blade, which makes me believe there's definitely some kind of interaction between the 6500 and the blade chassis, and not a 6748 problem. But I'm open to other points of view. In terms of utilization, this switch isn't even breaking a sweat. Aggregate traffic across all blades (3 6748, 2 sup 720's, and 2 6148A's) is, at best, 3 Gb aggregate transmit and 2 Gb aggregate receive at peak. I do know about the effects of the 6148A's on the backplane and we're working to get rid of them, but we're still taking these errors at non-peak hours, so I don't think that's related.
Thanks,
Matt
07-19-2011 09:07 AM
Hello Matt,
When the remote-end device reject/drop a frame with wrong dot1q encapsulation, there is no signalling methodology
to inform local device. Only mechanism we have to devices sync the vlans is via VTP.
By "transmit errors" I assume you are referring to "output errors" reported under "show interface X" command.
Please confirm.
6509# show interface po100
0 output errors, 0 collisions, 0 interface resets <<===
Did you have chance to capture "show counters interface x/y" ? This command provides all the hardware counters
and help us to find further details.
For e.g.,
6509#show counters interface te1/1
32 bit counters:
5. txCollisions = 0
11. txDelayExceededDiscards = 0
12. txCRC = 0
13. linkChange = 4
All Port Counters
13. XmitErr = 0
16. SingleCol = 0
17. MultiCol = 0
18. LateCol = 0
19. ExcessiveCol = 0
20. CarrierSense = 0
24. OutDiscards = 0
26. OutErrors = 0
28. txCRC = 0
31. WrongEncap = 0
46. Jabbers = 0
47. Collisions = 0
48. DelayExceededDiscards = 0
Unless, we know which type of error is reported as "output errors", it will be challenging to know the root cause.
I do not think vlan tag mismatch will cause Tx Error on 6500 side, rather it may be reported as "InErrors" or
"WrongEncap" or similar in the 3rd party device.
Regards,
Yogesh
07-19-2011 09:45 AM
Yeah, I was afraid you were going to say that. I appreciate the input. I agree -- I know of no mechanism on Ethernet that would allow the blade chassis to tell the 6500 that the tag was not being allowed.
Here's what Po4 looks like (I've cut things that are just informational):
DCRzH1#sh int po4
Port-channel4 is up, line protocol is up (connected)
Description: LACP w/Gi2/33, Gi2/35
MTU 1500 bytes, BW 2000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Full-duplex, 1000Mb/s
input flow-control is off, output flow-control is off
Members in this channel: Gi2/33 Gi2/35
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 74981250
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 3572000 bits/sec, 503 packets/sec
5 minute output rate 7277000 bits/sec, 1410 packets/sec
26851861583 packets input, 30907881065759 bytes, 0 no buffer
Received 34749726 broadcasts (32237850 multicasts)
0 runts, 24 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
0 input packets with dribble condition detected
15114451769 packets output, 8782295634353 bytes, 0 underruns
497456 output errors, 0 collisions, 1 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
Here is the show counters output for All Counters for po4. Anything that is zero I've deleted.
All Port Counters
1. InPackets = 17095751146
2. InOctets = 16570219490228
3. InUcastPkts = 16718006789
4. InMcastPkts = 374684861
5. InBcastPkts = 3059155
6. OutPackets = 4835748483
7. OutOctets = 1565461816054
8. OutUcastPkts = 4120133145
9. OutMcastPkts = 638348680
10. OutBcastPkts = 77266658
12. FCSErr = 1
13. XmitErr = 497477
14. RcvErr = 20
21. Runts = 6
22. Giants = 33
23. InDiscards = 282456706
24. OutDiscards = 451911727
25. InErrors = 20
26. OutErrors = 497477
28. txCRC = 248749
29. TrunkFramesTx = 4397110438
30. TrunkFramesRx = 16720518234
35. rxTxHCPkts64Octets = 5443938905
36. rxTxHCPkts65to127Octets = 4274742469
37. rxTxHCPkts128to255Octets = 670188140
38. rxTxHCPkts256to511Octets = 77236322
39. rxTxHCpkts512to1023Octets = 79033724
40. rxTxHCpkts1024to1518Octets = 953916423
42. CRCAlignErrors = 1
50. qos0Outlost = 450916773
88. qos0Inlost = 282456686
105. Overruns = 282456686
We don't have QoS turned on for this switch.
It's possible there's a hardware failure going on here. Across two blades, I doubt it's the 6500. Then again, the errors are occurring across two separate switches in the blade chassis, so it's hard to point to a double hardware failure there.
In Po3, Gi1/33 and Gi1/35 are members. In Po4, Gi2/33 and Gi2/35 are members. Our monitoring doesn't show any other errors on the switch that would be causing this (i.e. no receive errors on the uplinks). We are also using this as a layer 3 standalone device -- all VLANs on the 6500 are isolated to this chassis. No VLANs are spanned elsewhere.
Po4 is running just under 100 Mb transmit and around 50 Mb receive as I type this. Neither direction has broken 100 Mb this morning, so it's not a bandwidth issue.
Thanks,
Matt
07-19-2011 02:13 PM
I'm starting to think I'm looking at a broadcast issue. I can see how broadcast responses could be filling transmit buffers. And indeed, there's some odd traffic, so I'm investigating that with our server admin. But still, if anyone else sees anything, I'd be grateful for the input.
Thanks,
Matt
07-21-2011 07:05 AM
Matt,
Output Drops under "show interfaces" and QoSLost (in and out) indicates excessive/bursty traffic. Overruns indicates over-subscription and fabric flow-control occurring in the switch.
DCRzH1#sh int po4
Port-channel4 is up, line protocol is up (connected)
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 74981250
497456 output errors, 0 collisions, 1 interface resets
50. qos0Outlost = 450916773
88. qos0Inlost = 282456686
105. Overruns = 282456686
You are on the right track. First thing I would do is to find the source for broadcast storm. Also, in the switch you can enable a GOLD test to find which module is getting oversubscribed and initiating flow-control.
Do following commands under "config t":
diagnostic monitor Module 5 test TestFabricFlowControlStatus cardindex
diagnostic monitor interval Module 5 test TestFabricFlowControlStatus 00:00:00 100 0
First command enables the test, and second one set the interval to 100 msec (which is recommended only for troubleshooting).
Here, mod #5 is the active sup engine.
Once enabled, you can do "show diagnostic events" to find the results.
Hope it helps.
Regards,
Yogesh
07-21-2011 02:08 PM
Forgive my paranoia, but these commands are nondisruptive, correct?
This is a switch in our data center, so before I start running diagnostics on it I want to make sure that it's not going to cause any traffic disruptions.
I already hooked up my laptop on a trunk link for these and saw a number of multicasts to an unknown MAC (HP blade chassis heartbeats), plus also saw broadcast UDP traffic to port 16464 (source IP was the server, destination 255.255.255.255, both source and destination UDP port were 16464). Not sure what that's for. There were a few servers doing the UDP broadcasts, so I already sent that to our server team for investigation.
Thanks,
Matt
07-22-2011 11:54 AM
Matt,
These commands are non-disruptive. Having the test interval as 100 msec is aggressive. So, I would recommend you to run it for a short time (5-10 mins) and disable it.
Yogesh
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide