Solved: Re: ACI Multi-site Design Question -ISN/ IPN suitable types and MSO location

mikey_p · ‎03-16-2021

Hi,

I have a hypothetical design question with regards to Multi site ACI. I see most articles and designs discussing the use of Dark fibre between sites. I do however see that it is also possible to use other solutions too. I.e. If there were three Data centres one in the US, one in Asia and one in Europe could you use a L3 /L2 MPLS VPN network to connect the three sites together?

What would the requirements be if it is possible? I see information about maximum MTU and it appears with the use of the MSO there is less requirements with regards to a maximum RTT between the sites.

With regards to the location of the MSO, where could that be located? Could it exist anywhere in the network or best to house it on one of the sites? In theory with a connection into the MPLS network into a cloud provider (express route etc) it could reside in Azure / AWS right?

Thanks

Mike

Robert Burns · ‎03-17-2021

Hey Mike,

I'll address your questions one by one. First wrt using mpls in the IPN - as long as the Spines can terminate OSPF on the ISN devices at each site, you can use any IP transport in-between devices within the ISN. This includes dark fibre, mpls, dwdm etc. Ensure they support jumbo frames of 1600B or larger (9K recommended) to account for VXLAN overhead. (FYI - The control plane can be tuned down from the default 9K MTU).

In terms of latency with the later versions:
Between MSO Nodes (within their cluster): 150ms

Between MSO nodes & APIC Clusters: 1s

Between Sites: No latency Restrictions

When considering 'where' to host your MSO nodes - you have options. First, you have to ensure whichever location can achieve the latency requirements to each site (1s). Secondly you might want to consider resiliency. A Cloud provider like AWS/Azure is an option as well. Just need IP reachability from the MSO cluster to each APIC management subnet. All cloud providers can offer you either a private connection (VPN tunnel) into your network and you can tighten security allowing MSO the ability to only take with your APIC clusters. Cloud providers have redundancy built in, so you really don't have to worry as much as host redundancy, power issues etc. This option of course comes at a cost, as does any cloud-hosted resource. Luckily the MSO VMs (OVA) aren't too resource intense, and the communication between MSO & sites is not a great deal (async).

The other option is to self-host onprem. Again, with latency considerations in mind, as long as you can achieve these requirements you might have the luxury of spreading your MSO cluster across sites (two nodes in Site1, one node in another Site2). With this option you have at least one replicated MSO node at each site. Let's examine failure scenario here. Should the Site1 become unavailable the remaining MSO node at Site2 goes into a R/O state. Your config is locked, but safe. At this point you either have the option to fix the underlying connectivity issue, or replace the unreachable nodes. By deploying at least one more MSO node to Site2, you can replace the failed MSO nodes, and restore the cluster to a healthy state and re-gain R/W to MSO. For the reverse scenario there is zero impact. Losing one of the MSO nodes would degrade the cluster, but you'd retain R/W control.

One last thing to take into consideration is the form factor. The OVA image you're likely referring to (docker based MSO image) will not be developed beyond MSO 3.1(1). We are moving to the Nexus Dashboard (previously known as the Service Engine) deployments. Nexus Dashboard is currently only available as a physical appliance (same HW as Service Engine) but will also be offered as a virtual appliance later this year. When the vND is released its going to require a much larger virtual footprint than the previous MSO OVA image as it's a Kubernetes based system (which will run on ESXi & KVM Hypervisors) specifically designed to host our integrated suite of Day 2 applications like Nexus Insights, Network Assurance Engine and MSO. There is also a cloud tuned version image of vND in the works for AWS/Azure.
Hope this helps your decision. Let us know if you have any other questions.
Robert

View solution in original post

Robert Burns · ‎03-17-2021

Hey Mike,

I'll address your questions one by one. First wrt using mpls in the IPN - as long as the Spines can terminate OSPF on the ISN devices at each site, you can use any IP transport in-between devices within the ISN. This includes dark fibre, mpls, dwdm etc. Ensure they support jumbo frames of 1600B or larger (9K recommended) to account for VXLAN overhead. (FYI - The control plane can be tuned down from the default 9K MTU).

In terms of latency with the later versions:
Between MSO Nodes (within their cluster): 150ms

Between MSO nodes & APIC Clusters: 1s

Between Sites: No latency Restrictions

When considering 'where' to host your MSO nodes - you have options. First, you have to ensure whichever location can achieve the latency requirements to each site (1s). Secondly you might want to consider resiliency. A Cloud provider like AWS/Azure is an option as well. Just need IP reachability from the MSO cluster to each APIC management subnet. All cloud providers can offer you either a private connection (VPN tunnel) into your network and you can tighten security allowing MSO the ability to only take with your APIC clusters. Cloud providers have redundancy built in, so you really don't have to worry as much as host redundancy, power issues etc. This option of course comes at a cost, as does any cloud-hosted resource. Luckily the MSO VMs (OVA) aren't too resource intense, and the communication between MSO & sites is not a great deal (async).

The other option is to self-host onprem. Again, with latency considerations in mind, as long as you can achieve these requirements you might have the luxury of spreading your MSO cluster across sites (two nodes in Site1, one node in another Site2). With this option you have at least one replicated MSO node at each site. Let's examine failure scenario here. Should the Site1 become unavailable the remaining MSO node at Site2 goes into a R/O state. Your config is locked, but safe. At this point you either have the option to fix the underlying connectivity issue, or replace the unreachable nodes. By deploying at least one more MSO node to Site2, you can replace the failed MSO nodes, and restore the cluster to a healthy state and re-gain R/W to MSO. For the reverse scenario there is zero impact. Losing one of the MSO nodes would degrade the cluster, but you'd retain R/W control.

One last thing to take into consideration is the form factor. The OVA image you're likely referring to (docker based MSO image) will not be developed beyond MSO 3.1(1). We are moving to the Nexus Dashboard (previously known as the Service Engine) deployments. Nexus Dashboard is currently only available as a physical appliance (same HW as Service Engine) but will also be offered as a virtual appliance later this year. When the vND is released its going to require a much larger virtual footprint than the previous MSO OVA image as it's a Kubernetes based system (which will run on ESXi & KVM Hypervisors) specifically designed to host our integrated suite of Day 2 applications like Nexus Insights, Network Assurance Engine and MSO. There is also a cloud tuned version image of vND in the works for AWS/Azure.
Hope this helps your decision. Let us know if you have any other questions.
Robert

mikey_p · ‎03-19-2021

Hi Robert,

I appreciate the details response. Now if only I could find the same input on my other questions dotted around the community.

Thanks

Mike