02-16-2017 03:59 PM - edited 03-01-2019 05:09 AM
Hello,
we have an ACI Dual-Fabric environnement (version 12.0(2l) :
1 ACI Fabric per DC, both fabrics are interconnected by layer 2 vPCs (DCI) between dedicated Leafs on each DC
all EPG VLANs extended between both DCs
L3OUT VLANs to outside world not extended : L3OUT VLANs are extended between both DCs via the 2 Cat6Ks where are connected the L3OUT links
On all BDs (1BD per EPG), 1 distinct physical MAC (pMAC) is configured on each DC and 1 shared virtual MAC (vMAC) on both DCs :
same pMACs, vMAC for all BDs :
for each BD, 1 distinct IP is associated to each of those 3 BDs MACs : 2 physical IPs and 1 virtual IP
for example :
BD pMAC = 00:22:BD:F8:19:F1 on DC1 Fabric, and 00:22:BD:F8:19:F2 on DC2 Fabric
BD vMAC = 00:22:BD:F8:19:F0 on both DC Fabrics
On L3OUT, 1 shared vMAC is configured on both DCs : same vMAC on all L3OUT connected to outside active/standby Firewalls via Cat6K (1 per DC)
L3OUT vMAC = 00:22:BD:F8:19:FF on both DC Fabrics
5 distincts IPs are defined per L3OUT : 1 per leaf used to connect the local Cat6K (2 leafs used in vPC) and 1 Virtual IP shared between both DCs
We currently have observed instability related to those MACs when, for example, 1 PC from outside the fabrics communicate (via RDP, SSH, Pings, ...) with VM servers connected to any of both fabrics;
This problem is easily reproducible by just pinging the outside Firewall facing the ACI leafs , via the associated L3OUT :
a) if an iping is done from an ACI leaf towards this Firewall (defined as the default gateway on the L3OUT config), without specifying any source IP, there is no problem
b) when doing the same iping to the Firewall, by specifying a source IP with one of the ACI EPG IP, every 1 or 2 mins, pings stop working for about 20 secs;
when taking a monitor session capture trace on both Cat6K on all concerned Portchannels (the one to the outside FW, the one to the local ACI, and the one to the Cat6K in the other DC), it appears that, regularly, GARP and ARP replies are sent by both ACIs towards their local Cat6K with either the L3OUT vMAC as Source MAC (which seems logical since they flow over an L3OUT segment) or the BD vMAC 00:22:BD:F8:19:F0 (which seems less logical since the L3OUT is not configured with this vMAC at all)
the ACI IP presented in those GARP and ARP Replies are always the correct Virtual IP defined on the L3OUT on both DCs
and, according to the tests performed, it is clear that ipings start to fail when the MAC presented by the ACI in the GARP/ARP Replies change;
Any idea woud be welcome to understand why this annoying phenomenum occurs
02-16-2017 11:17 PM
Hi
Long story short: Dual Fabric with stretched firewall cluster is a pain. Have a look at the following documents, it describes how to implement this setup:
White Paper/Design Guide:
Video:
https://www.youtube.com/watch?v=Qn5Ki5SviEA
HTH
Marcel
02-17-2017 04:37 AM
Hello Marcel,
thanks for your update;
we had already reviewed the design guide you mentionned and had tried to understand as much as we could
Well, in our case, we don't use a Firewall Cluster as described in the guide and in the video :
we simply used an active/standby FW config (1 per DC) and only static routing : only 1 FW is active at a time;
and we don't attach the Firewalls to an EPG extended between both DCs, as described in the video;
but we attach those Firewalls to a specific L3OUT on each DC : same VLAN for this L3OUT on both DCs, but the VLAN is not extended between both DCs thru the DCI links between both DC ACIS, since this is not allowed ; this VLAN is extended on the Firewalls side via the L2 switches located between the ACI Leaves and the FW on each DC
so, it is less complex, and this design was validated by CISCO and worked properly for weeks on some pilot environment
and now, for no appearant reason, we got this really annoying problem only on 1 tenant ...
we checked all parameters (vlans, IP, flood, contracts, ..) between the failing tenant and the good ones : nothing different detected on ACI
quite strange
02-22-2017 07:29 AM
Yes it's strange - However we noticed similar strange issues in a dual fabric scenario. In the end we migrated to a multipod-fabric... ...and all problems were solved.
02-22-2017 07:49 AM
To summarize, TAC and BU are investigating. A Bug was filed and is being looked at:
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvd22682/?reffering_site=dumpcr
Joey
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide