cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

ACI ip flapping issue

484
Views
0
Helpful
4
Comments
Beginner
Ask Me Anything (Formerly known as Ask the Expert)
thru

I have ip flapping issue in cisco ACI environment

as the topology:

I found that when icmp reply from 168.1.37.129 to 168.1.37.45,these icmp reply packets will be sent to SW13 and SW14 at the same time,the icmp reply packets which sent to SW13  with S-IP:168.1.37.129 and              S-MAC:d9bc,other icmp reply packets which sent to SW14 with S-IP:168.1.37.129 and S-MAC:66ec,in  other words,this is “ip flapping" issue.

problem:

in this case,when 168.1.37.45 ping 168.1.37.129 without interruption,i found 168.1.37.45 can receive icmp reply packets from 168.1.37.129  without interruption,but more than 10 minutes later it can not receive icmp reply packets suddenly,show endpoints command on APIC list 168.1.37.129 associate mac is d9bc;after a few minutes 168.1.37.45 can receive icmp reply packets again and show endpoints command on apic list 168.1.37.129 associate mac is 66ec.

Solution:

enable arp flooding

i think this feature can resolve this problem,but it is not root cause for this probelm.

questions:

1.I want to know the root cause for 168.1.37.45 can not receive icmp reply packet suddenly.

2.this is NIC Teaming active/active without vPC config at server side,so it can cause ip flapping.Does this phenomenon have anything to do with "endpoint loop protection or rogue endpoint control"?

3.if there have others method to resolve this problem without enable arp flooding?

4 Comments
Beginner

This is a typical scenario with ACI learning behavior when A/A without LACP is configured. The recommended approach is to use LACP. You can try to see if this issue is still there by using the following command from the leaf switch:

zcat /mnt/ifc/log/epmc-trace* | grep moved | grep "<YOUR MAC>"

 

If you have "rogue endpoint" detection enabled, when severe flap occurs, ACI will prevent learning for certain amount of period, hence the endpoint will not be available at all. Endpoint flapping will cause packet loss when it's too severe, or even cause EPM/EPMC to crash sometimes. 

 

If you can't reconfigure the server to be LACP or Active/Standby, there are 2 options that I am aware of:

 

Disable dataplane learning - this is a BD wide setting and will make ACI stop learning endpoints via data-plane, within that specific BD. In 4.0, there is also an option to disable dataplane learning at the VRF level. 

 

Move the gateway outside of the ACI, so we are not learning any endpoints at all.

 

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/4-x/L3-configuration/Cisco-APIC-Layer-3-Networking-Configuration-Guide-401/Cisco-APIC-Layer-3-Networking-Configuration-Guide-401_chapter_01111.html

 

I also recommend you go through the endpoint learning whitepaper

https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/application-centric-infrastructure/white-paper-c11-739989.html

 

Beginner
hi, peterzhang thanks for your reply,i have another questions as bellow: 1. when it ping success a few minutes later it ping failure suddenly,but after few minutes it ping success again,and this phenomenon will repeat all the time. Does the “rogue endpoint control" cause this phenomenon? 2.if question 1 is yes,how about disable “rogue endpoint control" for this case? 3.we found that arp flooding function can solve this problem without LACP configuration,did you mean if we disable "arp flooding" that we must "disable dataplane learning" and "move the gateway outside of the ACI" to solve this problem? many thanks...
Beginner
1. If you have "rogue endpoint control" enabled, it will disable learning for a certain periods of time (a few minutes), which can cause the issue you are seeing (intermittent ping). 2. Disabling rogue endpoint control will allow it flap, which means one of the two can happen: a. if the flapping happens too frequent, it will start causing packet loss, you'll see faults in APIC telling you that it's seeing "duplicated IPs". To certain extreme, it can cause EPM and EPMC (Endpoint Manager and Controller) to crash, which is as bad as it gets b. If the flapping is not frequent, it might not affect anything. However, endpoint flapping is non-optimal regardless and should be resolved 3. If ARP flooding is solving the issue, then 2 potential scenarios comes to mind: The flooding is keeping the endpoint alive (even if the endpoint may be moved from switch A to switch B), but in this case, we are refreshing the endpoint from ARP'ing, which potentially means that it's not frequently moving (understanding if this endpoint is moving frequent or not is also important in this case) The endpoint is rather silent, which enabling ARP'ing + Gleaning solves the issue completely So I'd recommend you confirm that even after we learn the endpoint through ARP, the endpoint is not flapping. If it is, then the fundamental issue has not been completely resolved. did you mean if we disable "arp flooding" that we must "disable dataplane learning" and "move the gateway outside of the ACI" to solve this problem? Not necessarily. Disable ARP flooding means we are sending ARP packets as unicast packets, and if the endpoint isn't in the COOP database in the Spine Proxy or the leaf LST/GST, we'll drop it (silent hosts are a perfect example). Disable dataplane learning means we are no longer learning endpoints from the dataplane, it will also disable remote leaf from updating the IP-to-VTEP information, hence it helps with stopping the endpoint flapping. (When using L4 - L7 service insertion, this option is always disabled if the BD facing the firewall because we don't want to learn the firewall IP)
Beginner
Somehow the old post didn't show up as the way I wanted and I can't edit it either. sorry for the re-post 1. If you have "rogue endpoint control" enabled, it will disable learning for a certain periods of time (a few minutes), which can cause the issue you are seeing (intermittent ping). 2. Disabling rogue endpoint control will allow it flap, which means one of the two can happen: a. if the flapping happens too frequent, it will start causing packet loss, you'll see faults in APIC telling you that it's seeing "duplicated IPs". To certain extreme, it can cause EPM and EPMC (Endpoint Manager and Controller) to crash, which is as bad as it gets b. If the flapping is not frequent, it might not affect anything. However, endpoint flapping is non-optimal regardless and should be resolved 3. If ARP flooding is solving the issue, then 2 potential scenarios comes to mind: a. The flooding is keeping the endpoint alive (even if the endpoint may be moved from switch A to switch B), but in this case, we are refreshing the endpoint from ARP'ing, which potentially means that it's not frequently moving (understanding if this endpoint is moving frequent or not is also important in this case) b. The endpoint is rather silent, which enabling ARP'ing + Gleaning solves the issue completely So I'd recommend you confirm that even after we learn the endpoint through ARP, the endpoint is not flapping. If it is, then the fundamental issue has not been completely resolved. did you mean if we disable "arp flooding" that we must "disable dataplane learning" and "move the gateway outside of the ACI" to solve this problem? Not necessarily. Disable ARP flooding means we are sending ARP packets as unicast packets, and if the endpoint isn't in the COOP database in the Spine Proxy or the leaf LST/GST, we'll drop it (silent hosts are a perfect example). Disable dataplane learning means we are no longer learning endpoints from the dataplane, it will also disable remote leaf from updating the IP-to-VTEP information, hence it helps with stopping the endpoint flapping. (When using L4 - L7 service insertion, this option is always disabled if the BD facing the firewall because we don't want to learn the firewall IP)
This widget could not be displayed.