Nexus 5548s to Cisco 3945 via L3 OSPF ECP Issue?

Ignacio Rios · ‎12-05-2016

Hello,

This is my first time posting here, but I do use the Support Community often when I'm looking for answers. I have searched for the possible answer to my problem for well over two weeks and have tried various solutions discussed in other posts, but I just can't get a handle on the exact problem I'm experiencing. I'm running two Nexus 5548UP (both have L3 daughter cards) to host SVIs that run HSRP. These 5548s run FEX down to two B22HP-P (basically HP chassis 2K extenders). To route the SVIs out, the 5548s connect to a Cisco 3945. Each 5548 has a 1Gb link to the 3945 and I'm running OSPF ECMP between them. I'm using a Vlan to peer the 5548s through OSPF via the vPC Peer-Link.

Initially, everything appears to work correctly. We get a nice consistent ping to the server (in diagram) from out workstations (2 hops away and not pictured) averaging 1ms with a high of 4ms. After about 2-3 days the pings will range from 1ms - 2000ms. My co-worker is then unable to RDP to the server. Luckily this is still in the testing phase and not in production.

I have tried doing pings just across the L3 links from the router to the 5548s and the delays are just as bad when the problem arises. A shut/no shut of interface G0/0/0 or G0/1/0 on the router will clear up the problem for another few days. The router is using EHWICS and I've tried replacing SFPs, different fibers, and going to copper. The problem is persistent. The routing table on the router shows the two entries for the SVIs on the 5548 and both entries are also in the CEF table. Today, when the problem was showing itself again, I took the primary link out of the OSPF protocol and just tried testing across the link and the delay was still all over the place. When I look at the CPU utilization of the router and 5548s I don't see anything that is eating up cycles. I've read that the 5548s naturally sitting at a higher average. Just looking at mine now shows an average of 10% with spikes of 60% across the board for the last 72-hrs. I can't ever find anything that is eating up cycles causing those 60% spikes. I've checked the CoPP policy on both of the 5Ks and I've never seen the ICMP-echo policy map show any "violated" bytes, so the CoPP policy isn't the culprit.

I originally thought it was a problem with HSRP because I don't see the Gateway-bit for the virtual mac when I run "show mac address-table," but that wouldn't explain my ping delays from the G0/0/0 and G0/1/0 to Eth1/32 (L3) interfaces. Despite not seeing the Gateway-bit in the MAC address-table, HSRP looks like it's working. The "show hsrp" command shows that the 5548s see eachother. The hello packets are making it across the vPC link and the timers are refreshing, so it looks like it's working. 5548-1 is showing as active for the VLANs and 5548-2 is on standby.

I currently don't have the peer-gateway command in vpc domain, but I've had it on and the problem still persisted the same way. Just leaving it off at the moment because from what I read, it was mostly EMC (a few other special vendors) that ignored the virtual-MAC of HSRP.

Lastly, I'm not getting any interface errors on the L3 links between the router and 5548s. Making this even hard to t-shoot. I thought it was a code issue on the 5548s because they were still on older 5.X code, but the problem persists in 7.X code.

The IPs in the diagram are not what I'm using, but everything else is configured exactly the same way. Since it's 3 configurations, I figured it was easier to upload than copy/paste everything into here.

Here are the versions of code:

C3945: 15.2(2)T1

5548UP: 7.1(4).N1

Would really appreciate any help. I've poured through many Cisco documents and links provided in other threads for help, but I couldn't find anything.

TL/DR: Running L3 ECMP between two 5548s that are running HSRP, SVIs, VPC and FEX. Getting weird ping results after it sits for a few days. Shut/no shut on L3 interface clears the problem, but it returns a 2-3 days later. No clue what's causing it because it's time based. Show commands on protocols and CEF tables look good even when issue is there. No interface errors.

Thanks

-Iggy

chrihussey · ‎12-05-2016

Looks like on the router interface g0/0/0 and loop 0 have the same IP address (10.0.0.0). This may be causing some issues.

I have a similar setup, but don't exclude the peering VLAN from the peer link. Not sure if that will make much of a difference.

Also, using /31s may be OK, but think /30s on the point to points may produce cleaner and more stable results. You'll get varying views on that. Just my thought.

Regards

Ignacio Rios · ‎12-05-2016

Shoot, good catch, but that was just a typo on my part from cleaning up the config. I'll edit that now. It should be 10.1.1.0/32. I see I also messed up the Lo0 on the drawing. I guess the router's Lo0 on the drawing doesn't matter all that much. I'm 100% positive there are no overlapping IP address on the equipment. I made the typos trying to sanitize the configuration to post here. Lets go with 10.1.1.0/32 on the diagram as well, it's simpler and easier to follow along with the Lo0 IP addressing of the 5Ks.

Reza Sharifi · ‎12-05-2016

Why aren't you assigning a host address to you vpc keep alive and mgmt0?

10.1.1.1 and 10.1.1.2

Also, there is not shortage of private IP and wondering why you assign /31 and not 30 or something larger (/25 or /24) specially since this is for management.

Also, loopback address should have a host address (/32).

HTH

Ignacio Rios · ‎12-05-2016

Reza,

Thanks for your input! The IP addresses here are private IP space, but the production IPs are not. I am limited on IP space on the real subnets, so I have transcribed them into private IP space to post here while maintaining the subnet masks.

As for not assigning a host address to the vPC KA? I couldn't produce an error with the KA link if I used a /31 instead of a /30. The KA came up just fine. Should I be using a /30? Isn't the whole point of the KA link to have OoB communication between the Nexus 5548s? They just need to be able to reach each other. I have the space to make it a /30, but the KA link is up and running fine on a /31...well at least according to the show commands.

I'm not actually using the MGMT0 for "management." I'm using it only as the KA link which Cisco says is OK to do on the 5Ks. I use the loopbacks for actual device management.

Agreed on Loopbacks being a /32 and they are. Just didn't annotate the CIDR in the config files posted.

Ignacio Rios · ‎01-03-2017

Update:

There was a code bug with the 3900 router with the TxLoad being 255/255. Upgraded to newer code and it's been solid for a few weeks now. Thanks for the help!

chrihussey · ‎01-03-2017

Great. Thanks for the update.