背景:
Cisco ACI 和 nexus 系列交换机,默认接口的 link debounce timer 为 100 毫秒。debounce timer 的作用是尽量避免 link flap 从而避免影响流量,场景比如对端修改 interface configuration、对端 auto tune 等。
若没有检测到对端信号,会将 debounce timer start & 计时,100ms 内信号恢复,则认为 link 没问题;100ms 内仍然没有对端信号,才会 bring down 本端。
测试:
在特定场景,客户可能需要调整 debounce timer 来达到某种目的,比如 Cisco & non-Cisco 设备对接,当对端 HA 切换,切换以后在 ASIC 层面将 link flap 来更新 ARP 等信息,若 flap 时间超过 100ms 完成,就会被 Cisco 检测到,产生真正的 link down event。
假设对端无法做优化,可以尝试从 Cisco 的 debounce timer 来调整。
NTP 时间同步的精确度在几百毫秒,下面测试结果供参考:
admin-F6_F# show lldp nei | grep 1/22
f6leaf101.aci.pub Eth3/22 120 BR Eth1/22
f6leaf102.aci.pub Eth4/22 120 BR Eth1/22
f6leaf101# show system internal ethpm info interface e1/22 | grep -i bounce
flowcontrol rx(off) tx(off), link debounce(100), 默认 debounce timer 100 ms
Speed(0x9e), Duplex(0x1), Flowctrl(r:0x3,t:0x3), LinkDebounce(0x1)
admin-F6_F(config)# int e3/22
admin-F6_F(config-if)# show clock ; shut
Time source is NTP
10:54:40.997 GMT Mon Mar 06 2023
admin-F6_F(config-if)# no sh
f6leaf101# eventlog interface e1/22
ID Timestamp Trigger Affected Object Description Cause
------------ --------------------- ------- ----------------------- --------------------------------- ---------
433792231805 2023-03- oper sys/phys-[eth1/22]/phys Port is down. Reason: notconne... port-down
06T10:54:41.460+08:00
==================================================================================
f6leaf101# show system internal ethpm info interface e1/22 | grep -i bounce
flowcontrol rx(off) tx(off), link debounce(1000), 调整 debounce timer 为 1000ms
Speed(0x9e), Duplex(0x1), Flowctrl(r:0x3,t:0x3), LinkDebounce(0x1)
admin-F6_F(config-if)# show clock ; shut
Time source is NTP
11:06:13.995 GMT Mon Mar 06 2023
admin-F6_F(config-if)# no shut
f6leaf101# eventlog interface e1/22
ID Timestamp Trigger Affected Object Description Cause
------------ --------------------- ------- ----------------------- --------------------------------- ----------
433792231930 2023-03- auto sys/phys-[eth1/22] PhysIf eth1/ transition
06T11:05:37.312+08:00 22 modified
433792231939 2023-03- oper sys/phys-[eth1/22]/phys Port is down. Reason: notconne... port-down
06T11:06:15.361+08:00
==================================================================================
f6leaf101# show system internal ethpm info interface e1/22 | grep -i bounce
flowcontrol rx(off) tx(off), link debounce(5000), 调整 debounce timer 为 5000ms
Speed(0x9e), Duplex(0x1), Flowctrl(r:0x3,t:0x3), LinkDebounce(0x1)
admin-F6_F(config-if)# show clock ; shut
Time source is NTP
11:07:28.463 GMT Mon Mar 06 2023
admin-F6_F(config-if)# no shut
f6leaf101# eventlog interface e1/22
ID Timestamp Trigger Affected Object Description Cause
------------ --------------------- ------- ----------------------- --------------------------------- ----------
433792231930 2023-03- auto sys/phys-[eth1/22] PhysIf eth1/ transition
06T11:05:37.312+08:00 22 modified
433792231964 2023-03- oper sys/phys-[eth1/22]/phys Port is down. Reason: notconne... port-down
06T11:07:33.868+08:00
思考:
调整 debounce timer 会引入新问题:流量黑洞。
假设 debounce timer 调整为 1000ms,对端真的断开了,那么 Cisco 设备需要 1000ms 才能够检测到;那么这 1 秒内,就是流量黑洞,对于延迟敏感的业务来说,足够产生影响了,操作需谨慎。