Is there any function built in Nexus 5010 to detect intermittent link down?

DennisLee1 · ‎11-24-2011

Hi experts,

I found intermittent link down(20~40 seconds average) occurred about 1~10 times every month. SAP reported a lot of active connections are disconnected and I used a batch to ping and found "requested time out" about 30 seconds.

And Windows, SQL server, Nexus 5010 do not show any errors. We run cluster and cluster does not fail over.

And I don't know which cables or nics cause this issue. When it happened, almost all servers are unreachable. For example, SQL server 1 -> SQL server 2, IBM HS22-1 -> SQL server 1. However, some connections are not dropped sometimes. It varies each time.

PS: I run this topology last year without any problems but it started intermittent link down from 2011/1/7. Because there is no errors in Nexus 5010, it is difficult to troubleshoot. Cisco TAC recommended us to implement virtual port channel yesterday.

Could I use "errdisable detect cause" to detect what caused the intermittent link down? Is there any error logs or switch parameters/status can use to troubleshoot?

andrew.prince · ‎11-25-2011

What other devices connect to the n5k?

Sent from Cisco Technical Support iPad App

DennisLee1 · ‎11-26-2011

Hi Andrew,

IBM HS22 with Nortel BNT 6-ports, HP dl980 with NC550SFP.

Alexander Maroukian · ‎11-26-2011

Hi Dennis,

What does the NMS tool log which you use for this network is showing for this very moment when the connections are down?

Best regards,

Alex

DennisLee1 · ‎11-27-2011

Hi Alexander,

Our network team used network monitoring tools(I am not sure if it is the NMS) and he observed the same pattern. Many servers could not reach via ping at the moment. However, no one knows why. mac address flapping? defective ports or links?

andrew.prince · ‎11-27-2011

are you running vPC?

Sent from Cisco Technical Support iPad App

DennisLee1 · ‎11-27-2011

Hi Andrew,

No yet. Could vPC solve this problem? We will change port channel from PACP to LACP at DB connection and use vPC in Dept 10. Any thoughts?

Alexander Maroukian · ‎11-28-2011

Is the ping to your network devices ok in this moment of interruption I did not understand. You said that ping is getting timeout to servers from servers and from your monitoring tools. Where the monitoring tools reside? Which connections they use? Are there any firewall or multicast in the network? If you have link flapping you should see it in the log of the Nexus. In the moment of the problem are you able to ping the nearest switch from the server which disconnected the sessions? Are the servers in one vlan? These anwers could really help in locating the problem.

Best regards,

Alex

DennisLee1 · ‎11-28-2011

Hi Alex,

Is the ping to your network devices ok in this moment of interruption I did not understand. You said that ping is getting timeout to servers from servers and from your monitoring tools. Where the monitoring tools reside?

>Monitoring tool is located outside of the broadcast domain. Catalyst 6513 are connected to a lot of server-farm switches, the monitoring tool is running in one of the server-farm switch.(not showed in this graph)

Which connections they use? Are there any firewall or multicast in the network?

>No, no firewall or multicase.

If you have link flapping you should see it in the log of the Nexus. In the moment of the problem are you able to ping the nearest switch from the server which disconnected the sessions?

>No, the whole broadcast domain is affacted. However, sometimes 100% connections are disconnect, and sometimes only 80~90%.

Are the servers in one vlan? These anwers could really help in locating the problem.

> Yes, one vlan.

DennisLee1 · ‎11-30-2011

Hi Alex/Andrew,

Here is my new strategy.

I am not sure how to troubleshoot using wireshark/MS network monitor so I ran some tests on my testing environment. Please check if I did it correctly.

1. I installed network monitor on DB and APP servers and ping each other permanently.

2. I filter IP and protocol( IPv4.Address == 192.168.28.99 AND IPv4.Address == 192.168.28.109 ) (ICMP, port 7)

3. I recorded the time when the link is down.

4. I checked .cap files on both sides and I found there are only request sends on both sides but no one gets any replies. The conclusion is both servers, services, NIC drivers, teaming drivers and OS settings are OK. The source of the problem comes from the middle device(switches or links).

Alexander Maroukian · ‎12-01-2011

I do not see the connection to the switch - how it reacts. E.g ping from sql server1 to Nexus 1, sql server 1 to Nexus 2, and ping from sql server 1 to sql server 2 this should be done simultaneously. This is one way that we can see where is the problem. E.g. if ping is ok from sql server 1 to nexus 1 but fails to nexus 2 then we can search the problem there if you can ping nexus 2 but you cannot ping sql server 2 then connection between nexus 2 and sql server 2 should be verified and we we can focus on nexus 2 and sql server 2 only.

Best regards,

Alex

DennisLee1 · ‎12-01-2011

Hi Alex,

I revised my strategy after combining your recommendations and opinions from Microsoft forum. If you have any other ideas, please let me know.

http://social.msdn.microsoft.com/Forums/en-US/sqldataaccess/thread/415ba445-c227-4bf2-9c00-25bd3ed114bf

1. I install network monitors on SQL server 1, SQL server 2, APP server 1 and APP server 2 and I ping each other together with 6 middle devices(5010 * 2 plus Nortel BNT 6ports * 4) permanently.

2. I filter only protocol (ICMP) and this minimizes overhead.

3. I record the time when the link is down.(both in SAP and in my ping log file)

4. I check .cap files on every hosts and if I found there are only request sends on the node but no one gets any replies. The conclusion is both servers, services, NIC drivers, teaming drivers and OS settings are OK. The source of the problem comes from the middle devices(switches or links).

Alexander Maroukian · ‎12-02-2011

OK, Dennis let's see what will be the capture when the link is down. You have syslog server for Nexus and other netowrk devices should be checked for the same imtermittent moment.

Best regards,

Alex

DennisLee1 · ‎12-12-2011

Hi Alex,

Sorry, I am back. I found intermittent link down again.

I found intermittent link down issue became worse. Please check out our netmoncap.zip in my ftp site to see if you could find what's going on, thanks. I will check later but I am an SAP system administrator and not an expert at networking. I would appreciate your assistance.

ftp://ftp01.quantatw.com/

user: sapftp password: wju123

When does intermittent link down happen 2011/12/13:

1:02pm

1:04

1:06

1:13

1:18

1:24

1:30

1:34

topology:

dl980-1 => nexus-1 => nexus-2 => dl980-2

tccap36 => nortel-1 or 2 => nexus-1 or 2 => dl980-1

ip list:

dl980-1(active DB): 192.168.28.11

dl980-2(passive DB): 192.168.28.12

tccap36(APP server 1): 192.168.28.110

tccap40(APP server 1): 192.168.28.115

Nexus 5010 ip: 192.168.28.251 192.168.28.252

Nortel ip: 192.168.28.25~28

DennisLee1 · ‎12-13-2011

Hi Alex,

Not all connections are broken when intermittent link down occurred. I found it is complicated to identify the source of the problem. Should I combine wireshark with port mirroring? Becuase we use port aggregation, only Rx could be received, right? Is there any documents in Cisco mention how to troubleshoot the problem like this in detail? Wireshark offical guide? or >show techsupport? Any information will be appreciated.