Solved: Packets loss at the same VLAN

AGomez12 · ‎12-13-2017

Hi everybody.

This is my first discussion here. I have an issue, and I would like your help.

My client is having packets loss between servers at the same VLAN. They have a Cluster Configuration with FI 6296UP. I already have checked and somebody told me the issue is because the vNIC's are pinning to FI-A, and my Primary is FI-B. But I read and the Primary and Subordinate are just for management, not for forwarding traffic, so I don't know what is happening here. I have little time working with UCS.

I will really appreciate your help.

Best regards.

AGomez12 · ‎01-04-2018

Hi guys, sorry for do not answer before.

Well, let me tell you what happens with this case:

I read the answer from you all and I tested everything. It was very useful. But a friend of mine told me that I needed to check the role of Fabric Interconnect and where was connected the interfaces on my server. I checked and my blades were connected to Fabric A, and my Fabric B was the primary. So, I applied a change in the roles of Fabric and my issue was solved.

I understand the roles of Fabric are just for management, not for traffic, but it worked for me.

The commands applied was:

- connect local-mgmt b (Fabric B was my primary)

- cluster lead a (Change leadership from Fabric B to Fabric A).

This change doesn't affect the traffic, so, I didn't have any issue with my data.

Thank you all guys.

View solution in original post

Wes Austin · ‎12-13-2017

First let me start out by saying you are correct in your assumption that FI-A vs FI-B does not matter, as both FI will forward traffic, unless you are configuring it specifically to only pin to one or the other. You need to understand if this is packet loss on the same blade, chassis, then FI.

First, how do you know you are having packet loss? Are you running packet captures? Monitoring software? Ping loss?

If you are having intermittent ping loss, see if the issue occurs from VM to VM on the same host. If it does, you need to investigate the hypervisor accordingly. Then, test between VMs on different servers, but within the same chassis and pinned to the same Fabric Interconnect.

Lastly, you can attempt to isolate the issue by putting one VM on a blade that has is using the vNIC pinned to FI-A, while on another blade use a VM that is pinned to FI-B. If you only see packet loss during this test, you need to investigate your upstream switches.

Walter Dey · ‎12-14-2017

Could the cause of the problem not be outside of UCS ?

eg. assume eg, Vlan 10 on both fabric, and one server or VM connects its vnic to fabric A and the destination server of VM connects its vnic to fabric B.

This means the traffic has to exit UCS Northbound, getting L2 switched, and finally reenter UCS on the other FI / fabric and fabric.

I've seen cases, where the path Nortbound was limited to 1G, causing a severe bottleneck, possibly with frame loss.

Evan Mickel · ‎12-15-2017

One quick tip as well to validate the physical path would be to SSH to the UCS cluster VIP, then run:

1) Connect nxos a

2) show interface counters errors

Run step two multiple times allowing 10 seconds between entries. This will be a good way to verify that errors are not incrementing. Complete the same tasks beginning with 'connect nxos b.' There could of course be packet loss introduced upstream as Walter mentions due to traffic volume, but it could also be as simple as CRCs from a bad cable or SFP.

The steps listed above are a good enough jumping off point for path verification that we should be able to direct you further from there depending on what you report. The tests Wes mentioned would be good to execute as well.

Thanks!

AGomez12 · ‎01-04-2018

Hi guys, sorry for do not answer before.

Well, let me tell you what happens with this case:

I read the answer from you all and I tested everything. It was very useful. But a friend of mine told me that I needed to check the role of Fabric Interconnect and where was connected the interfaces on my server. I checked and my blades were connected to Fabric A, and my Fabric B was the primary. So, I applied a change in the roles of Fabric and my issue was solved.

I understand the roles of Fabric are just for management, not for traffic, but it worked for me.

The commands applied was:

- connect local-mgmt b (Fabric B was my primary)

- cluster lead a (Change leadership from Fabric B to Fabric A).

This change doesn't affect the traffic, so, I didn't have any issue with my data.

Thank you all guys.

Wes Austin · ‎01-04-2018

Running that command will restart a lot of management processes on the Fabric Interconnects which may have resolved your issue.

Walter Dey · ‎01-05-2018

Hi

I can hardly believe that this Problem is related with the lead role of the FI cluster.

Maybe it is a side effect as Wes mentioned below !

If you do this, maybe all dual homed vnic's are Failover to the same fabric ! and L2 switching is done inside UCS (on a FI).

Therefore it could mean a temporary fix, and the initial Problem might show up again !