High latency observed while pinging servers

Nirmalbang · ‎10-05-2019

Hi All,

Yesterday evening faced a strange issue with my Nutanix infra. Sharing background of my setup and issue faced.

My setup

1) I have 5 node Nutanix cluster

2) Connected to 2 Cisco 3850 XS

3) Nutanix Nodes are directly connected to both 3850 via fiber on 10G interfaces

4) LACP active-active configured on all nodes connected to 3850's

Problem Faced

1) Logged a case with Nutanix team as cluster shows in degraded State.

2) While troubleshooting support team observed high latency observed while pinging VM-A on host 1 to VM-B on host 2.

3 )Further troubleshooting, I did a ping to VM-A and VM-B from switch itself and found latency is 1ms

4) Nutanix support team was very clear that this is a network related problem, as the latency between controller VM's accross cluster was very high.

5) To isolate the issue I did a ping to VM-A and VM-B from 3850 switch itself and found latency was 1ms. This made the scnerio more confusing. I argued with support team that how can it be a network issue when I am getting ping to controller vm in 1ms from switch's.

6) convinced Nutanix support team to restart the network services and they did it, still the problem remain.

7) Rebooted the active 3850, no improvement in the latency while pinging VM-A on host 1 to VM-B on host 2.

😎 Rebooted the second switch that became master. After rebooting the switch 2 and pinging VM-A to VM-B latency is less then 1ms and things started getting normal. The Nutanix team confirmed that now the cluster health is ok and I also observed that things are working fine.

9) The Cisco 3850's are production since last 2 years and we're not rebooted since last one year. The OS on 3850's is 16.6.4

I would like to know in future if such issue re occurs then what steps can be taken to isolate the issue as after rebooting both the 3850's things started working fine.

Deepak Kumar · ‎10-05-2019

Hi,

Have you recorded any logs from the switch? There may be many reasons as you faced some type of looping as l2, high CPU or Memory uses, Interface issues, Any QoS may be dropped ICMP or heartbeat packets, etc.

Without getting logs, it is very difficult to explain a solution but you may check many things as mentioned.

Regards,
Deepak Kumar,
Don't forget to vote and accept the solution if this comment will help you!

Nirmalbang · ‎10-07-2019

Hi Deepak.

The problem occurred on a off day evening so cpu and memory utilization was at minimal. We had not done any cabling changes so this rules out the option of l2 loop.If you can let me know what logs needs to be check.

Georg Pauwen · ‎10-06-2019

Hello,

it is actually a very common problem that switches with a long uptime (> 1 year) start to slow. I would suggest a scheduled reboot every year.

Nirmalbang · ‎10-07-2019

Hi Georg,

Yes, restart is what i did and thing got normal. Will observe it for someday. If it reoccurs than will raise the case with TAC team.