Re: Same Bridge Domain used in Application EPG and MicroSeg EPG?

bl80 · ‎08-20-2021

Primary issue is that after we setup an application EPG for a physical blade connected via the same port channel used for our VMM connection we have been having so many nuanced issues in the fabric.

Main question is if using the same Bridge Domain with both a standard EPG as well as with a micro-segmented EPG is supported? I am working to get this server moved to a new BD, wont be able to until next week. But very curious to find out if this is a known issue or anyone else has had similar buggy problems.

More Detail :

The VMM Domain is VMWare, uplinks are standard port channels. AAEP contains 2 domains : 1 physical for the blade (static pool) , 1 for VMware (dynamic pool).

Application EPG for the physical blade is defined under the AAEP.

3 primary VRFs. 2 in Tenant 1 "Back-end-Systems" (BE) and "Front-end-Systems" (FE). Tenant 2 has "DMZ-Systems"

Tenant 1 has Application Profiles defined "FE-AP" and "BE-AP"

Tenant 2 has Application Profile "DMZ-AP"

FE-AP has 2 primary EPGs : FE-ORACLE-SERVER (this is the physical blade) and micro segmented FE-VM-SYSTEMS

Both use Bridge domain "FE-BD" 10.100.24.0/21

BE-AP has 1 EPG : micro segmented BE-VM-SYSTEMS

Bridge Domain "BE-BD" 10.100.32.0/21

DMZ-AP has 1 EPG : micro segmented DMZ-VM-SYSTEMS

Bridge Domain "DMZ-BD" 10.100.16.0/21

All of the "vm-systems" EPGs use the same VMM Domain

The "oracle-server" uses the Physical domain

There are hundreds of servers deployed throughout the 3 micro segmented EPGs

There are 10+ uEPGs under each micro segmented EPGs
Attribute to define servers under specific uEPGs is with "Contains" and "VM Name".

Soon after (about 10 hours) after we deployed the Physical Blade we started having issues.

One of the uEGPs in the same FE AP had a bad subnet mask, this caused the entire subnet to be known throughout the entire fabric as only a /24 and not a /21. This was found and corrected but was odd and this bad mask was found to have been there for months.

The Physical blade EPG and every single other uEPG within the ENTIRE fabric have 5 primary contracts. 3 of these are to L3outs via a single firewall for access to legacy networks. 2 others are also L3outs to other external resources. There also was a contract that was created and used sparingly in the fabric just for PING -- this had been created when first brought everything online about a year ago and just wanted to get ping working. This contract had been added between the physical blade and those 3 L3outs to get initial connections verified.

One evening, late after hours, all the L3outs stopped working to the FE based uEPG systems. 100% failure. Routes still showed fine but no traffic working.

Long story short is that TAC was able to find that the return traffic from the L3outs were sending all traffic for the 10.100.24.0/21 subnet to the pcTag for the Physical server EPG due to that PING contract. This was removed between those EPGs and traffic restored. They are saying its due to following bug but provided zero info on why this occured or why performing rollbacks to previous day did not fix it https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvi20535

Now, 2 weeks later all the uEPGs in that FE AP cannot ping each other within the same uEPG. Everything outbound from the uEPGs still working, but server to server in same uEPG failing *most* of the time. Works occasionally then stops. Very sporadic.
Works fine in all other uEPGs. Cisco TAC has spent 10+ hours grabbing captures and trying to find a solution. They mentioned that the way the its setup is not supported but did not supply any specific documenation besides the standard micro segementation setup white sheet. They are blaming the server to server failure on "likely a problem" with the HP virtual connect not forwarding the arp reqeusts (unable to get span from that as of yet).
Again - all the other VRFs using the exact same uplink/VMM domain/trunk ports and only the VRF with the physical blade is having the problem.
All other uEPGs in both Tenants working except for this one. I have 3 test VMs that I can move inbetween the various uEPGs/VRFs and they fail 100% when in the FE VRF. Work 100% in all other. Moving them is just done by mapping the correct network in VCenter and changing the IP addresses on the VMs.

Let me know if I can supply any more info. ACI has been very unreliable and difficult to troubleshoot. Comapny invested millions in this for 3 datacenters and its barely hanging on in one. Upper mgmt considering ditching it completely as its now been root cause for 3 site-wide outages.

Robert Burns · ‎08-20-2021

Base & uSeg EPGs in the same BD are fine. I have seen many issues as a result of the non-standard ways HP blade switches (virtual connect) handle certain traffic. There are some whitepapers from both Cisco & HP on this that do call out some of the best practices depending whether you're using tunnel or trunking mode on HP VC . Depending on the uSeg config, Proxy ARP may be required (intra-EPG isolation) as well as matching up PVLAN pairs on any intermediate devices such as the HP switches.

This sounds like a very complex issue(s), and if TAC is already engaged I'd encourage to work through that channel. It's going to be very hard for the community to help you without a full understanding and access to your environment. If you're not making the progress needed with TAC, ask your account team to escalate the issue for you. If you're in need of general "how-to" troubleshoot ACI training, I'd suggest to ask your Cisco account team to engage Cisco CX and they should be able to provide some knowledge transfer. I'd also suggest a general health check and best practice validation (CX can provide this also). If you're having multiple outages there could be lots at play here, and without knowing the details of your external networks, requirements etc- hard for us to address those here.

Robert