PFC triggered before ECN

daewookhwang · ‎05-22-2024

Hello all,

I am using four ConnectX6-DX NICs and one Cisco N5624Q switches (with two modules) to buildup a RoCEv2 cluster. So it's a dumbbell topology where each switch module is connected to two CX6DX nodes, and all links are 40 Gbps:

(2 nodes w/ CX6DX) ==== (N5624Q module 1) ---- (N5624Q module 2) ==== (2 nodes w/ CX6DX)

I am using DCQCN in NICs and the problem is, in most cases it just triggers PFC in the interfaces connected to NICs, not marking ECN or triggering PFC between switches (which is the bottleneck link). For example,

# show interface priority-flow-control 
===================================================================
Port                 Mode    Oper(VL bmap)        RxPPP     TxPPP 
===================================================================

Ethernet1/5     On          On    (ff)             0         204                      -> switch module 0 to node0
Ethernet1/7     On          On    (ff)             0         114                      -> switch module 0 to node1
Ethernet1/11   On          On    (ff)             0          0                         -> switch module 0 to 1
Ethernet2/1     On          On    (ff)             0          0                         -> switch module 1 to 0
Ethernet2/5     On          On    (ff)             0         430                      -> switch module 1 to node2
Ethernet2/7     On          On    (ff)             0          28                       -> switch module 1 to node3

And ECN is not marked in this case. As I know in normal congestion control scenarios, ECN is supposed to be triggered before PFC as long as there is congestion, and PFC acts as temporary solution when the ECN-based control is too late to control this congestion. But our cluster shows a different behavior: most occasions PFC is triggered, and ECN is rarely marked.

We indeed set ECN threshold low enough (1000, 3000), trying to actively mark them. Below is our switch configuration with `show running-config`:

ip domain-lookup
no system default switchport
logging event link-status default
service unsupported-transceiver
class-map type qos match-any CNP
match dscp 0, 48
class-map type qos match-any RoCE_qos_class
match dscp 24, 26
class-map type queuing CNP_queuing_class
match qos-group 5
class-map type queuing RoCE_queuing_class
match qos-group 3
policy-map type qos RoCE_qos_policy
class RoCE_qos_class
set qos-group 3
class CNP
set qos-group 5
policy-map type queuing RoCE_queuing_policy
class type queuing RoCE_queuing_class
bandwidth percent 100
class type queuing CNP_queuing_class
priority
class type queuing class-default
bandwidth percent 0
class-map type network-qos CNP_network_class
match qos-group 5
class-map type network-qos RoCE_network_class
match qos-group 3
policy-map type network-qos RoCE_network_policy
class type network-qos RoCE_network_class
set cos 3
pause no-drop
class type network-qos CNP_network_class
set cos 6
class type network-qos class-default
system qos
service-policy type qos input RoCE_qos_policy
service-policy type queuing output RoCE_queuing_policy
service-policy type network-qos RoCE_network_policy
hardware unicast voq-limit
hardware profile tcam feature interface-qos limit 100
hardware random-detect min-thresh 10g 64000 40g 1000 max-thresh 10g 128000 40g 3000 ecn qos-group 3
hardware pq-drain 10g 9900 40g 39900

That is, we are using RoCEv2 to transmit data and prioritize CNP above them and we set ECN marking thresholds only for qos-group 3 (RoCE_network_class).

After several experiments, we found that ECN is marked as long as there is PFC triggered in the link between 2 switches (more precisely, the link that connects 2 switch modules in one N5624Q). And with `show hardware profile buffer monitor` I saw that unicast egress buffer is filling up to 50kb (in 5-sec everage). What might be the cause of this problem and how can I fix it? Is it possibly something related to shared-buffer or voq settings?

+ disabling PFC (with `no pause no-drop`) indeed marks enough ECN packets. It seems that configuring PFC indeed hinders ECN behavior.

Regards,

Daewook