cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
669
Views
2
Helpful
9
Replies

Extending Spine to IPN connections breaks inter-pod communication

Nik Noltenius
Spotlight
Spotlight

Hi folks,

first of all: We have already opened a TAC case, but I thought it cannot hurt to have several irons in the fire

This is our situation: We have a Multi-Pod environment with 100 Gbps links from the Spines to the IPN and a full mesh of 400 Gbps connections between the IPN nodes. Now we wanted to extend the Spine to IPN bandwidth by adding two 400 Gbps links from every Spine to the IPN. The 100 Gbps links should have eventually been disabled but for the change we kept them running.
Everything was configured following the blueprint. After all the fabric is up and running for years without any issues. However, as soon as we enable a single 400 Gbps link towards the IPN connectivity between the pods gets impaired. The strange things is, that not all connections are affected and - at least until now - we cannot find any common criteria which would tell us, what goes down and what continues to work. The fabric itself is perfectly fine, there are not new faults and the APIC cluster (which is distributed across two pods) also stays fully fit. All VTEPs remain reachable.

NikNoltenius_0-1760426607764.png

As soon as we remove the 400 Gbps connection, traffic flow stabilizes and everything is fine again.
Now, I really cannot make sense of this. I understand that the 400 Gbps link probably takes precedence over the two 100 Gbps connections serving the same pod due to a higher bandwidth leading to lower OSPF cost but that should still work, doesn't it?
The IPN L3Out configuration and the configuration on the NXOS nodes has been quadruple checked by three different persons. That doesn't guarantee anything but for now I'd say that it's not a config issue. Might be a design problem though, that I don't understand...
So if anyone has any ideas about this your comments are highly appreciated

Thanks and kind regards,
Nik

2 Accepted Solutions

Accepted Solutions

What about the other side configuration?

We need to look at your OSPF config as well. As you mentioned, this may be the preferred link due to cost, and we also need to address OSPF loops and how the network is advertised. 

The best advice here is: since you have a meshed network, why not shut down the 100 GB link and bring up the 400 GB link—make it neat.

Or try to increase the cost of the link, so it is not preferred in the path. But I still prefer the Physical link down, bring the 400GB link one leg at a time, and test.

BB

=====Preenayamo Vasudevam=====

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

View solution in original post

julian.bendix
Spotlight
Spotlight

Hey there!

There isn't enough info for me to pin this down. But I have an idea of what might be wrong..

Could it be that any BUM traffic (flooded L2 traffic) inside just a few certain Bridge Domains is breaking during the issue?

If yes ... here is my idea:
Could it be that the IPN Switch is not properly processing the IGMP join frames coming from the Spine?

If you have a look on the Spine and issue "show ip igmp gipo joins" you will find that the Spine is trying to evenly split the IGMP Joins across all active IPN Links.

You mentioned that there are no new faults seen on the APIC, so I would assume from the Spine's perspective the link is good and it is trying to shift some IGMP Joins there.
If the IPN Switch does not properly process those.. it would break the BUM Traffic (including ARP) inside excatly those BDs (and only those BDs!!).

Would that at least describe the behavior you were seeing?

No clue yet on why the IPN Switch would not process the IGMP Joins though... just an idea based on the description of the issues you were facing.

Gonna think about this a bit more..

BR Jules

View solution in original post

9 Replies 9

balaji.bandi
Hall of Fame
Hall of Fame

Do you have a sample configuration for the interface, OSPF, and BGP?

This may be Layer 2/layer3  or a routing Loop where the networks are re-advertised somewhere?

By the way, what is the purpose of this 400GB link?

 

 

 

BB

=====Preenayamo Vasudevam=====

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Hello Balaji,
thanks for the reply.

The diagram is highly simplified. In the end we want to replace all 100 Gbps connections by 400 Gbps links to extend inter-pod bandwidth. For the change we enabled the 400 Gbps links one by one and ran into connection issues almost immediately. For that reason, in later tests we only used one interface.

This is the configuration of the links on the IPN (one 100 Gbps, which has been working for years and the new, additional 400 Gbps):

interface Ethernet1/3
mtu 9150
no shutdown

interface Ethernet1/3.4
description 100G working fine
mtu 9150
encapsulation dot1q 4
vrf member IPN
ip address 123.123.123.133/30
ip ospf network point-to-point
ip router ospf IPN area 0.0.0.0
ip pim sparse-mode
ip dhcp relay address 100.64.0.1
ip dhcp relay address 100.64.0.3
ip dhcp relay address 100.64.0.2
no shutdown


interface Ethernet1/5
mtu 9150

interface Ethernet1/5.4
description 400G breaking connectivity
mtu 9150
encapsulation dot1q 4
vrf member IPN
ip address 123.123.123.193/30
ip ospf network point-to-point
ip router ospf IPN area 0.0.0.0
ip pim sparse-mode
ip dhcp relay address 100.64.0.1
ip dhcp relay address 100.64.0.3
ip dhcp relay address 100.64.0.2
no shutdown

 

What about the other side configuration?

We need to look at your OSPF config as well. As you mentioned, this may be the preferred link due to cost, and we also need to address OSPF loops and how the network is advertised. 

The best advice here is: since you have a meshed network, why not shut down the 100 GB link and bring up the 400 GB link—make it neat.

Or try to increase the cost of the link, so it is not preferred in the path. But I still prefer the Physical link down, bring the 400GB link one leg at a time, and test.

BB

=====Preenayamo Vasudevam=====

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Thanks again

This is the (remaining) ospf config on the IPN:

router ospf IPN
auto-cost reference-bandwidth 400 Gbps
vrf IPN
router-id 123.123.123.1
log-adjacency-changes

On the Spine side these screenshots should cover the configuration.
General L3Out settings for BGP:

NikNoltenius_0-1760435081144.png

One of the spine nodes:

NikNoltenius_1-1760435218353.png

Routes sub-interfaces on that node:

NikNoltenius_2-1760435345711.png

OSPF Interface Policy is set to network type point-to-point but other than that all default values.

As for the fabric external connection policy, we left everything as it was and just added the new transfer networks:

NikNoltenius_4-1760435861220.png

If I'm not mistaken that's the only part where we actively decide what to redistribute. In this case from OSPF into the fabric-internal IS-IS process.

I mean, loops would explain the behavior but I cannot get my head around how adding another link to the picture could lead to a looping behavior...

As to your suggestion: I agree, the 100 Gbps links should probably have been shutdown but that opportunity is gone now. And before we get another maintenance window we need a better understanding of what might have happened.

Unfortunately, with this information, we cannot determine what caused the issue at that time. You can only look at the routing table at the time the problem occurred. When routes are learned and redistributed, you can check the device's show routing-table output to see which routes are learned from both the routing links connected to it.

As mentioned before, you have mesh connectivity, so removing one link should not cause the issue until you are fully 100GB over-subscribed.

 

 

BB

=====Preenayamo Vasudevam=====

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Thanks again,
yes, it seems like we need to be in the problem situation to fully understand the issue. 
Right now we are checking if a design with mismatched speed on the Spine uplinks is even "officially supported". On the next maintenance window we'll definitely try to use the 400G links only as we received multiple hints - including yours - that this would be the best practice path.
I'll keep you updated.

julian.bendix
Spotlight
Spotlight

Hey there!

There isn't enough info for me to pin this down. But I have an idea of what might be wrong..

Could it be that any BUM traffic (flooded L2 traffic) inside just a few certain Bridge Domains is breaking during the issue?

If yes ... here is my idea:
Could it be that the IPN Switch is not properly processing the IGMP join frames coming from the Spine?

If you have a look on the Spine and issue "show ip igmp gipo joins" you will find that the Spine is trying to evenly split the IGMP Joins across all active IPN Links.

You mentioned that there are no new faults seen on the APIC, so I would assume from the Spine's perspective the link is good and it is trying to shift some IGMP Joins there.
If the IPN Switch does not properly process those.. it would break the BUM Traffic (including ARP) inside excatly those BDs (and only those BDs!!).

Would that at least describe the behavior you were seeing?

No clue yet on why the IPN Switch would not process the IGMP Joins though... just an idea based on the description of the issues you were facing.

Gonna think about this a bit more..

BR Jules

Thank you for your input Julian,
this could indeed lead to the behavior that we have and it's a totally new angle to look at the issue. Highly appreciated!
However, I doubt that we have a problem processing IGMP joins. After all you can see that the configuration on the 400G link on the IPN exactly matches the working ones. Only the transfer network is a different one. No other configuration has been changed on the IPN devices. So if they would really treat IGMP joins differently depending on which link they come in, that would be a SW/HW bug, right? I mean, it's possible of course, and I will keep an eye on that in our next maintenance window, but I still believe a design issue as described by Balaji is more likely.

Nik Noltenius
Spotlight
Spotlight

I'm happy to announce that we successfully upgraded our Spine to IPN bandwidth.
We followed Balaji's suggestion to modify the ospf link cost on all interfaces. To be more precise we set the exact same ospf cost value for all links no matter the bandwidth on both ACI L3Out and IPN side. Then we enabled the 400G links one after the other, checked for OSPF neighborship and monitored traffic. All was fine and after 10-15 minutes we disabled the 100G links.
What could (and should) have been that easy to begin with, was a smooth transition now, thanks to treating all links equally from a routing point of view.
Also, I want to again thank Jules as he pointed into the right direction of why this whole mess could have happened in the first place. I'm still confused about the details but as he said BUM traffic is flooded along loop free trees (forwarding tag or FTAG trees) which are rooted at the spine. There are several FTAG trees and BD and VRF multicast groups are distributed across all of them. Now in our originial situation with multiple 100 G links with a default OSPF cost of 4 and a single 400 G link with a cost of 1 there is only one valid (in the sense of most cheap) way out towards the IPN and it's bound to one spine. For flooded traffic that lands here all is fine. Other traffic, however, hitting a different spine is now trapped. It cannot go to a different spine as it would have to cross a leaf again which would break the loop-freeness of the FTAG tree and it cannot be forwarded as there is no unequal cost multi-pathin in OSPF. I don't know what exactly happens but the traffic is probably discarded.
So now problem is fixed and on a high level we even understand why we had an issue and why parts of the fabric still worked.

Review Cisco Networking for a $25 gift card

Save 25% on Day-2 Operations Add-On License