6500 switching delay problem

emailsbecker · ‎06-10-2011

Hi all,

I'm drawing a blank on this problem, I can't even think of what keywords to use to narrow my search to find a solution to this problem. So if anyone can think what keywords would apply to this, please post them - that way other people can find this thread if they need to see this.

We have a 6509 that has two servers directly connected, we'll call them Servers A and B. Both servers are on Vlan 59. The 6509 uplinks to a 6513 which has an interface vlan 59 with a gateway configured. Since both servers are on the same vlan though, communication should stay at layer 2 and be done completely on the 6509. Each night a script runs and when complete it transfers a file from Server A to Server B. Some nights the script runs properly, some nights it doesn't. Someone on the server team found that pinging from Server A to B was failing, but when he ran an extended ping eventually the connection comes up, and then goes back down a few minutes after the script is finished. It seems to take 40-100 pings before the connection comes up.

Without understanding the topology my first hunch was that trafic was being routed over a backup POTS line and the dropped pings were happening while the modem was dialing, however now that I've traced everything out I see the servers are directly connected to the same switch and there is no dialer config on there. My next thought was that either the servers weren't communicating very much and the arp tables were being cleared, or maybe the server is ging to sleep and is configured to wake on LAN (entries in the arp table are dropped after 4 hours by default, this 6509's arp timeout has not been modified). The server team denies this, stating the device sees regular traffic and never goes to sleep.

I've confirmed via sh int that traffic is incrementing regularly during the day, and though there may still be the possibility that after 5pm traffic dies down enough that the server's arp entry times out, my hunch is that this isn't the case though as this company runs a 24/7 operation. At 10am during normal daily traffic flow I was able to recreate the problem by configuring a vlan interface on the 6509 with an available address in Vlan59 and ran an extended ping, it dropped the first 51 packets and then the rest were replied to. An hour later all pings are still successful.

I'm stumped. Any ideas?

emailsbecker · ‎06-10-2011

Spoke with a friend who reminded me that if configured properly the arp table will never be used because the communication will stay on Layer 2 and never need to be resolved to an IP address. Although now that I think about it, I suppose I need to confirm how the script is written ... does it try to connect to Server B by IP, or hostname? If it tries by hostname then there may be an issue with a DNS added into the mix. I've confirmed that neither server has teamed NICs (we've had issues in the last few weeks with a different server that did). Anyone else have other ideas?

Peter Paluch · ‎06-11-2011

Hi,

You have mentioned that there is no NIC teaming configured on the servers. However, do the servers have multiple NICs connected to the (potentially same) network?

When you performed the extended ping, what was the device you sent the pings from?

What is the STP protocol version in use? If RSTP or MSTP, are all edge access and trunk ports duly configured as portfast-enabled ports? A topology change in RSTP or MSTP results in non-edge designated ports being blocked for up to 30 seconds.

Best regards,

Peter

Joseph W. Doherty · ‎06-10-2011

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

You might find something in:

http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00807347ab.shtml helpful.

Other 6xxx troubleshooting technotes found here:

http://www.cisco.com/en/US/products/hw/switches/ps700/prod_tech_notes_list.html

You might also want to describe what's in the 6509 (i.e. show module), software being used (show version) and what ports these two servers are connected to.

rsimoni · ‎06-11-2011

I guess the answer lies on what you wrote here:

"At 10am during normal daily traffic flow I was able to recreate the problem by configuring a vlan interface on the 6509 with an available address in Vlan59 and ran an extended ping, it dropped the first 51 packets and then the rest were replied to. An hour later all pings are still successful."

this is an indication that the server is not answering. If you happen to see the same symptom again you should quickly check the MAC table and see if you have the server MAC address correctly learned via the physical port it is connected to and, if present, its timer gets refreshed over time.

If there is no entry (or if the timer is not refreshed) it means that the switch is not seeing any frame coming back from the server which means that the server is not replying.

I don't think you have an issue on your switch or else you would see similar symptom for other devices on the same vlan.

Next time be ready to check the mac table and also configure a SPAN session to monitor that port to be sure that traffic is sent/received on the port connecting the server.

Riccardo

emailsbecker · ‎06-13-2011

Peter - The servers only have 1 NIC each. The server team sourced the extended ping from Server A to Server B. STP version is IEEE.

Joseph - Thanks, I'll look at the links in a minute. Server A is on module 9, Server B is on module 4:

4 48 48 port 10/100/1000mb EtherModule WS-X6148-GE-45AF
9 48 SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-TX

Riccardo - As a contractor I'm not here at 2am when the script runs. I'll see if we can make arrangements for someone on the network team to be here to look at the switch as it's happening. To keep the wireshark capture log size down I restricted the capture by filtering on the IP of Server B, for some reason there was not a single packet in the log that sourced from the IP of Server A (even though the script ran and the file transferred successfully). I'm going to set up another wireshark session tonight without any filters and and capture everything to see if I can get more info.

I also believe the issue is with the sever not the network, but would like to be able to hand our management something solid that shows this. Otherwise it's ust the network team and server team pointing fingers at each other.

rsimoni · ‎06-14-2011

good idea, try first a small capture without filters to be sure that you get traffic from serverA, then you can quickly test a capture filtering the source address.

Also don't forget that on wireshark you can use circular buffers; if you use them in conjuction with (working) filters you can leave the capture running over night and then the day after check only the file captured at the time of the outage.

I know well the neverending game between network and server team

Usually the network team is right though!

regards,

Riccardo