Solved: I was able to identify the - Page 2

michael.mastropaolo · ‎10-29-2015

Hello all.

This is my first post and i'll try to be as detailed as possible. I am upgrading the core of our network with two NX 6004's that are connecting north to two Catalyst 7606's. The 6004's also have connections going south to two NX 6001's. Everything is eBGP with all P2P links, detailed like this: (for clarity sake, i'm just going to use a single of each box)

7606 -> 6004 (Port-Channel - two 10Gb links on both sides)

6004 -> 6001 (40Gb P2P)

The eBGP peerings between the NX boxes come up just fine. The peerings between the 6004 and the 7606 does not come up what so ever. After digging around and debugging some bgp packets, I noticed that TCP never establishes what so ever. Forgetting about BGP for the moment, I then noticed when I run some pings from the 7606 with the df-bit set and a size of 1500 and a count of 100 (per se), every 15th packet is dropped, consistently. If I were to change the size up or down, it affects the dropped packet but at different intervals. For example, send a packet size of 1100, and its every 25th packet. Send a size of 8000 (when trying to set MTU manually on the interface), every 3rd packet was dropped. Here is what I have done so far:

Set MTU manually

Set P2P to a single link only

WireShark the link (no good info aside from no tcp response, which didn't yield much)

Wipe the NX box clean and only configured interface

IP TCP PATH-MTU-DISCOVERY was enabled globally on the 7606. I added it to the 6004

Configured static speed and duplex settings

I'm certain I've done a lot more that I cannot think of at the moment (have it documented at work). When I run the debug ip tcp transactions, I notice that the syn_sent to the neighbor (when originally trying to setup bgp) was timing out. It almost appears as though this is some buffer or window issue with the NX box but I am coming up short in my research of how to potentially fix this. Before I call TAC, I figured i'd post this.

I'm 99% certain its not a fiber issue or L1 issue as both NX boxes which have redundant P2P links to both 7606's are having this same exact issue. I'm also leaning on the fact of a potential bug between IOS and NX-OS; not too sure.

Any help would be appreciated.

Thanks.

-Michael

Jon Marshall · ‎10-30-2015

Michael

Thanks.

Is there any chance of running the same on the 7600 to see what it thinks ?

Jon

michael.mastropaolo · ‎10-30-2015

Here is the output of the "show ip bgp 10.251.177.2" command:

ar01.taipei#debug ip bgp 10.251.177.2
BGP debugging is on for neighbor 10.251.177.2 for address family: IPv4 Unicast
ar01.taipei#term mon
ar01.taipei#clear ip bgp 10.251.177.2
ar01.taipei#
Oct 30 10:49:27.148 EDT: BGP: ses global 10.251.177.2 (0) act read request no-op
Oct 30 10:49:27.148 EDT: BGP: ses global 10.251.177.2 (0) act Reset (Active open failed).
Oct 30 10:49:27.148 EDT: BGP: 10.251.177.2 active went from Active to Idle
Oct 30 10:49:36.061 EDT: BGP: 10.251.177.2 active went from Idle to Active
Oct 30 10:49:36.061 EDT: BGP: 10.251.177.2 open active, local address 10.251.177.1
Oct 30 10:50:06.061 EDT: BGP: ses global 10.251.177.2 (0) act read request no-op
Oct 30 10:50:06.061 EDT: BGP: 10.251.177.2 open failed: Connection timed out; remote host not responding
Oct 30 10:50:06.061 EDT: BGP: 10.251.177.2 active open failed - tcb is not available, open active delayed 14336ms (35000ms max, 60% jitter)
Oct 30 10:50:06.061 EDT: BGP: ses global 10.251.177.2 (0) act Reset (Active open failed).
Oct 30 10:50:06.061 EDT: BGP: 10.251.177.2 active went from Active to Idle
Oct 30 10:50:20.093 EDT: BGP: 10.251.177.2 active went from Idle to Active
Oct 30 10:50:20.093 EDT: BGP: 10.251.177.2 open active, local address 10.251.177.1

Here is the output of the show ip tcp transactions on the 7606 as well:

ar01.taipei#debug ip tcp transactions
TCP special event debugging is on
ar01.taipei#
Oct 30 10:50:33.909 EDT: TCP0: Got ACK for our FIN
Oct 30 10:50:33.909 EDT: TCP0: state was LASTACK -> CLOSED [45424 -> 10.252.217.74(49)]
Oct 30 10:50:33.909 EDT: TPA: Released port 45424 in Transport Port Agent for TCP IP
type 1 delay 240000
Oct 30 10:50:33.909 EDT: TCB 0x534EFCD4 destroyed
Oct 30 10:50:33.909 EDT: TCB534EFCD4 created
Oct 30 10:50:33.909 EDT: TCB534EFCD4 setting property TCP_NO_DELAY (0) 5237A42C
Oct 30 10:50:33.909 EDT: TCB534EFCD4 setting property TCP_VRFTABLEID (14) 5237A488
Oct 30 10:50:33.909 EDT: TCP: Random local port generated 60648, network 1
Oct 30 10:50:33.909 EDT: TCB534EFCD4 bound to 10.252.68.0.60648
Oct 30 10:50:33.909 EDT: TCB534EFCD4 setting property TCP_NONBLOCKING_WRITE (7) 5237A4B8
Oct 30 10:50:33.909 EDT: TCB534EFCD4 setting property TCP_NONBLOCKING_READ (8) 5237A4B8
Oct 30 10:50:33.909 EDT: TPA: Reserved port 60648 in Transport Port Agent for TCP IP type 1
Oct 30 10:50:33.909 EDT: TCP: sending SYN, seq 3470600175, ack 0
Oct 30 10:50:33.909 EDT: TCP0: Connection to 10.252.217.74:49, advertising MSS 9138
Oct 30 10:50:33.909 EDT: TCP0: state was CLOSED -> SYNSENT [60648 -> 10.252.217.74(49)]
Oct 30 10:50:33.913 EDT: TCP0: state was SYNSENT -> ESTAB [60648 -> 10.252.217.74(49)]
Oct 30 10:50:33.913 EDT: TCP0: tcb 534EFCD4 connection to 10.252.217.74:49, received MSS 1460, MSS is 1460
Oct 30 10:50:33.913 EDT: TCP0: FIN processed
Oct 30 10:50:33.913 EDT: TCP0: state was ESTAB -> CLOSEWAIT [60648 -> 10.252.217.74(49)]
Oct 30 10:50:33.913 EDT: TCP0: state was CLOSEWAIT -> LASTACK [60648 -> 10.252.217.74(49)]
Oct 30 10:50:33.913 EDT: TCP0: sending FIN
Oct 30 10:50:33.913 EDT: TCP0: Got ACK for our FIN
Oct 30 10:50:33.913 EDT: TCP0: state was LASTACK -> CLOSED [60648 -> 10.252.217.74(49)]
Oct 30 10:50:33.913 EDT: TPA: Released port 60648 in Transport Port Agent for TCP IP
type 1 delay 240000
Oct 30 10:50:33.913 EDT: TCB 0x534EFCD4 destroyed
Oct 30 10:50:34.093 EDT: TCP0: timeout #3 - timeout is 16000 ms, seq 2056312724
Oct 30 10:50:34.093 EDT: TCP: (35292) -> 10.251.177.2(179)
ar01.taipei#un all
All possible debugging has been turned off
ar01.taipei#
Oct 30 10:50:46.761 EDT: TCB534EFCD4 created
Oct 30 10:50:46.761 EDT: TCB534EFCD4 setting property TCP_GIVEUP (11) 4605B59C
Oct 30 10:50:46.761 EDT: TCP: Random local port generated 48372, network 1
Oct 30 10:50:46.761 EDT: TCB534EFCD4 bound to 10.252.68.0.48372
Oct 30 10:50:46.761 EDT: TPA: Reserved port 48372 in Transport Port Agent for TCP IP type 1
Oct 30 10:50:46.761 EDT: TCP: sending SYN, seq 247654830, ack 0
Oct 30 10:50:46.761 EDT: TCP0: Connection to 10.252.217.74:49, advertising MSS 9138
Oct 30 10:50:46.761 EDT: TCP0: state was CLOSED -> SYNSENT [48372 -> 10.252.217.74(49)]
Oct 30 10:50:46.761 EDT: TCP0: state was SYNSENT -> ESTAB [48372 -> 10.252.217.74(49)]
Oct 30 10:50:46.761 EDT: TCP0: tcb 534EFCD4 connection to 10.252.217.74:49, received MSS 1460, MSS is 1460
Oct 30 10:50:46.765 EDT: TCB534EFCD4 connected to 10.252.217.74.49
Oct 30 10:50:46.977 EDT: TCP0: FIN processed
Oct 30 10:50:46.977 EDT: TCP0: state was ESTAB -> CLOSEWAIT [48372 -> 10.252.217.74(49)]
Oct 30 10:50:47.065 EDT: TCP0: state was CLOSEWAIT -> LASTACK [48372 -> 10.252.217.74(49)]
Oct 30 10:50:47.065 EDT: TCP0: sending FIN
Oct 30 10:50:47.065 EDT: TCP0: Got ACK for our FIN
Oct 30 10:50:47.065 EDT: TCP0: state was LASTACK -> CLOSED [48372 -> 10.252.217.74(49)]
Oct 30 10:50:47.065 EDT: TPA: Released port 48372 in Transport Port Agent for TCP IP
type 1 delay 240000
Oct 30 10:50:47.065 EDT: TCB 0x534EFCD4 destroyed
Oct 30 10:50:47.069 EDT: TCB534EFCD4 created
Oct 30 10:50:47.069 EDT: TCB534EFCD4 setting property TCP_NO_DELAY (0) 5237A42C
Oct 30 10:50:47.069 EDT: TCB534EFCD4 setting property TCP_VRFTABLEID (14) 5237A488
Oct 30 10:50:47.069 EDT: TCP: Random local port generated 45022, network 1
Oct 30 10:50:47.069 EDT: TCB534EFCD4 bound to 10.252.68.0.45022
Oct 30 10:50:47.069 EDT: TCB534EFCD4 setting property TCP_NONBLOCKING_WRITE (7) 5237A4B8
Oct 30 10:50:47.069 EDT: TCB534EFCD4 setting property TCP_NONBLOCKING_READ (8) 5237A4B8
Oct 30 10:50:47.069 EDT: TPA: Reserved port 45022 in Transport Port Agent for TCP IP type 1
Oct 30 10:50:47.069 EDT: TCP: sending SYN, seq 2206818846, ack 0
Oct 30 10:50:47.069 EDT: TCP0: Connection to 10.252.217.74:49, advertising MSS 9138
Oct 30 10:50:47.069 EDT: TCP0: state was CLOSED -> SYNSENT [45022 -> 10.252.217.74(49)]
Oct 30 10:50:47.069 EDT: TCP0: state was SYNSENT -> ESTAB [45022 -> 10.252.217.74(49)]
Oct 30 10:50:47.069 EDT: TCP0: tcb 534EFCD4 connection to 10.252.217.74:49, received MSS 1460, MSS is 1460
Oct 30 10:50:50.093 EDT: BGP: 10.251.177.2 open failed: Connection timed out; remote host not responding
Oct 30 10:50:50.093 EDT: BGP: 10.251.177.2 active open failed - tcb is not available, open active delayed 14336ms (35000ms max, 60% jitter)
ar01.taipei#un all
All possible debugging has been turned off
ar01.taipei#

michael.mastropaolo · ‎10-31-2015

I was able to identify the issue. The 7600 box being so old, someone had configured a copp_management ACL on it. I had found this by pure luck because a coworker suggested to peer up the 6004 with the 4948, with it being an IOS device. The 4948 was clean and the peering came right up.

What I did though was I just used a random 192.168 space for the point to point to come up between the 6004 and 4948. Therefore, I left the address space on the interface on the 6004, and since we were running single mode LC to SC, I changed the cable to multimode running between the 6004 and the 7606. I then configured the 7606 interface with the 192.168 space as well. Boom, BGP came up.

My initial thought was there was something funky with all 8 single mode runs. It didnt make sense to me but whatever. I then changed everything back to use the 10 space, and what do you know, no BGP peering. Now I was intrigued. Next step; I changed the interface to 172.16 space. No peering. This was screaming some type of ACL to me. I searched the ACL list on the 7606 and was looking for one that had 192.168 space in it but not 10 space or 172.16 space in it. Found one that was listed as copp_management that must have been created years ago. Noticed that the 192.168 space had hits on it as well and (when I set interfaces back to 192.168 space), the count was increasing. Bingo, I knew this was it. Added permit for the 10 space, readdressed my interfaces and BGP came up.

Sheer luck because if I hadn't addressed my test between the 6004 and 4948 strictly using 192.168 space, I wouldn't have went down this path. Good to know it was an isolated incident and not a bug between NX-OS and IOS, code, etc.

Thanks for the assistance anyways!

Jon Marshall · ‎10-31-2015

Michael

Thanks for posting back with the solution

Luck maybe but good bit of investigate work as well :-)

Glad you got it sorted.

Jon

P2P Issues Between NX-OS and IOS