Troubleshooting FCIP links with non-stop restransmit failure errors

NathanBA918 · ‎12-18-2023

I have two MDS9220I switches and recently added a third MDS9220I at another site. While trying to configure a pair of new FCIP links to the new MDS9220I I am running into a strange situation.

When I create an initially bring the fcip interfaces up with just VSAN 1 trunking, the links come up fine, device-aliases distribute okay. As soon as I add the replication VSAN (we'll call it 100) to the trunk list, the links on the new MDS switch side begin killing the link complaining:

%PORT-5-IF_TRUNK_UP: %$VSAN 1%$ Interface fcip1, vsan 1 is up
%PORT-5-IF_TRUNK_UP: %$VSAN 199%$ Interface fcip1, vsan 100 is up
%PORT-5-IF_TRUNK_DOWN: %$VSAN 1%$ Interface fcip1, vsan 1 is down (TCP max retransmission reached)
%PORT-5-IF_TRUNK_DOWN: %$VSAN 100%$ Interface fcip1, vsan 100 is down (TCP max retransmission reached)
%PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 1%$ Interface fcip1 is down(TCP conn. closed - retransmit failure)
%PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 1%$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

Remote side is similar except with the message (TCP conn. closed by peer).

The only thing different than my other FCIP links is that I unfortunately need to route traffic for the switch1 IPStorage1/1 and switch2 IPStorage1/1 interfaces. To do so, I've setup static routes for the interfaces:

EG:
switch1 default gateway 10.0.0.1(/24), ipstorage1/1 on 192.168.100.10/24
ip route 192.168.200.0 255.255.255.255 192.168.100.1 interface IPStorage1/1
switch2 default gateway 10.1.0.1(/24), ipstorage1/1 on 192.168.200.10/24
ip route 192.168.100.0 255.255.255.255 192.168.200.1 interface IPStorage1/1

I'm trying to look for any other basic troubleshooting suggestions before I go to support. To the best of my knowledge the networks are up and fully funtional (Networking team assures me the networks are up and not firewalled, without issue) so I'm assuming I've got some part of the fcip config that I'm missing.

NathanBA918 · ‎09-30-2024

Follow up note for anyone else that sees this kind of behavior -- our problem was that we were using a 9000 MTU and the network was fragmenting our packets between the A and B side sites. The FCIP links are very sensitive to fragmentation and will lose link and retrain frequently as a result.

We had been assured by our networking team that jumbo frames were fully supported, but were able to prove to them the issue by doing the following:

1. Setting the MTUs on both sides back down to 1500 alleviated the issue

2. Restoring MTU to 9000 and using the extended ping feature to set the DF bit (Don't Fragment)

switcha# ping
Target IP address: 192.168.200.10
Repeat count [5]:
Datagram size [100]: 8960 <<< Get close to 9000, there's a little overhead so don't set to 9000 exactly
Timeout in seconds [1]:
Extended commands [n]: y
Source address or interface: 192.168.100.10 
Type of service [0]:
Set DF bit in IP header [n]: y  <<< Don't Fragment
Data pattern [in hex (without leading 0x)]:
Sweep range of sizes [n]:

You'll get a pretty clear response showing fragmentation is occurring, allowing you to request your Networking team either revise their statement of jumbo frame support confirming that jumbo is not supported.

Jason Mooney · ‎09-30-2024

Besides the solution Nathan mentioned, you can perform a ping sweep to see the different MTU sizes, and when it stops responding. This can confirm conclusively what the WAN MTU size is set to.

Validate MTU Size:

switch# ping 
Target IP address: 1.1.2.2 
Repeat count [5]: 1 
Datagram size [100]: 
Timeout in seconds [1]: 
Extended commands [n]: y 
Source address or interface: IPStorage8/5 
Type of service [0]: 
Set DF bit in IP header [n]: 
Data pattern [in hex (without leading 0x)]: 
Sweep range of sizes [n]: y 
Sweep min size [36]: 1392 
Sweep max size [18024]: 3000 
Sweep interval [1]: 10

I would use the following MDS commands to look at your FCIP retransmission rates:

switch# show ips stats tcp all
switch# show logging onboard error-stats

In error-stats look TCP_RETRANS_RATE lines. Cisco support up to 0.05% transmission rates on FCIP links. Anything above that will cause link stability issues. The nice thing about show logging onboard error-stats is it reports the re-transmission rate every 20 seconds. This will allow the MDS SAN Administrator to go back to their network administrators with specific timestamps of when the problem is occuring.

---------------------------------
Module: 1 error-stats
---------------------------------
Notes:
     - Sampling period is 20 seconds


----------------------------------------------------------------------------------------------------------------
 ERROR STATISTICS INFORMATION FOR DEVICE: FCIP MAC
----------------------------------------------------------------------------------------------------------------

Notes:
     - Sampling period is 60 seconds
     - TCP_RETRANS_RATE_XCD_THRESH logged only when Delta retransmit rate > retxmt threshold
     - TCP_SRTT_XCD_CONF_RTT logged only when SRTT shown in count column > 1.3 * configured RTT

----------------------------------------------------------------------------------------------------------------
Interface<tcp_con> |Error Stat Counter Name            |Delta      |Delta Count              |Time Stamp
                   |                                   |ReTrans%   |or                       |MM/DD/YY HH:MM:SS
                   |                                   |or SRTT%   |SRTT us                  |
----------------------------------------------------------------------------------------------------------------
fcip1<0>           |TCP_RETRANS_RATE_XCD_THRESH        |0.410      |10793                    |09/06/24 23:34:50
fcip1<1>           |TCP_RETRANS_RATE_XCD_THRESH        |0.442      |11571                    |09/06/24 23:34:50
fcip1<2>           |TCP_RETRANS_RATE_XCD_THRESH        |0.446      |11153                    |09/06/24 23:34:50
fcip1<3>           |TCP_RETRANS_RATE_XCD_THRESH        |0.441      |11070                    |09/06/24 23:34:50
fcip1<4>           |TCP_RETRANS_RATE_XCD_THRESH        |3.333      |1                        |09/06/24 23:34:50
fcip1<0>           |TCP_RETRANS_RATE_XCD_THRESH        |1.195      |29919                    |09/06/24 23:33:50
fcip1<1>           |TCP_RETRANS_RATE_XCD_THRESH        |1.212      |30292                    |09/06/24 23:33:50
fcip1<2>           |TCP_RETRANS_RATE_XCD_THRESH        |1.450      |34713                    |09/06/24 23:33:50
fcip1<3>           |TCP_RETRANS_RATE_XCD_THRESH        |1.306      |31285                    |09/06/24 23:33:50
fcip1<4>           |TCP_RETRANS_RATE_XCD_THRESH        |6.667      |2                        |09/06/24 23:33:50