03-28-2024 04:50 PM
Hello everyone!
I'm not a network engineer but as a subsystem (kernel + filesystem) engineer I know the network concept.
Few months ago, I designed a DMZ and configured my 2 node VPC with 6 rack switch.
Everything was smoothly working but I started to see some problems a month ago and my life turn into hell:
2024 Mar 28 22:48:09 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded!
2024 Mar 28 22:50:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 28 22:52:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded! (message repeated 1 time)
I was using "nxos.9.3.9", I saw a bug report and solution was upgrade and I upgrade it to "nxos.9.3.13" But my problem not solved.
I don't know what is the issue and I'm not able to digg due to I don't know how to diagnost..
When I get the buffer error all the packages are drops thats what I know.
VPC-SW-2# show interface counters errors non-zero
--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth1/1 0 0 0 0 0 369651
Eth1/8 0 0 0 0 0 1968446
Eth1/9 0 0 0 0 0 124332
Eth1/17 0 0 0 0 0 101073
Eth1/18 0 0 0 0 0 102809
Eth1/19 0 0 0 0 0 100208
Eth1/20 0 0 0 0 0 102725
Eth1/21 0 0 0 0 0 102590
Eth1/25 0 0 0 0 0 48752
Eth1/26 0 0 0 0 0 102281
Eth1/27 0 0 0 0 0 70208
Eth1/28 0 0 0 0 0 102652
Eth1/34 0 0 0 0 0 102646
Eth1/35 0 0 0 0 0 102849
Eth1/36 0 0 0 0 0 102430
Eth1/42 0 0 0 0 0 1968435
Eth1/43 0 0 0 0 0 1966384
Eth1/44 0 0 0 0 0 1968448
Eth1/46 0 0 0 0 0 32722
Eth1/47 0 0 0 0 0 45342
Eth1/48 0 0 0 0 0 24724
Eth1/49 0 0 0 0 0 102454
Eth1/50 0 0 0 0 0 99501
Eth1/51 0 0 0 0 0 100564
Eth1/52 0 0 0 0 0 102824
Eth1/53 0 0 0 0 0 102935
Eth1/54 0 0 0 0 0 103074
Po8 0 0 0 0 0 1968446
Po9 0 0 0 0 0 124332
Po17 0 0 0 0 0 101073
Po18 0 0 0 0 0 102809
Po19 0 0 0 0 0 100208
Po20 0 0 0 0 0 102725
Po21 0 0 0 0 0 102590
Po25 0 0 0 0 0 48752
Po26 0 0 0 0 0 102281
Po27 0 0 0 0 0 70208
Po28 0 0 0 0 0 102652
Po34 0 0 0 0 0 102646
Po35 0 0 0 0 0 102849
Po36 0 0 0 0 0 102430
Po42 0 0 0 0 0 1968435
Po43 0 0 0 0 0 1966384
Po44 0 0 0 0 0 1968448
Po49 0 0 0 0 0 102454
Po50 0 0 0 0 0 99501
Po51 0 0 0 0 0 100564
Po52 0 0 0 0 0 102824
Po53 0 0 0 0 0 102935
Po54 0 0 0 0 0 103074
Po100 0 0 0 0 0 102788
What changed? Maybe wrong cabling overtime my best bet..
I have some IPMI switches and I shut their port now and hunting the root cause.
My switches are:
VPC-SW-1 : C93180YC-FX3 [ BIOS: version 01.09 | NXOS: version 9.3(13) ]
VPC-SW-2 : C93180YC-FX3 [ BIOS: version 01.09 | NXOS: version 9.3(13) ]
datasw-aa-03: C93180YC-FX [ BIOS: version 05.51 | NXOS: version 9.3(13) ]
datasw-aa-04: C93180YC-FX [ BIOS: version 05.51 | NXOS: version 9.3(13) ]
datasw-aa-06: C93180YC-FX [ BIOS: version 05.51 | NXOS: version 9.3(13) ]
datasw-aa-08: C93180YC-FX3 [ BIOS: version 01.09 | NXOS: version 9.3(13) ]
datasw-aa-10: C92160YC-X [ BIOS: version 07.41 | NXOS: version 7.0(3)I3(1) ]
datasw-aa-11: C92160YC-X [ BIOS: version 07.41 | NXOS: version 7.0(3)I3(1) ]
VPC-SW-1# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 28 22:48:09 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded!
2024 Mar 28 22:50:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 28 22:52:10 NILE1 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded! (message repeated 1 time)
---------------------------------------------------------------------------------------
VPC-SW-2# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 28 22:48:18 NILE2 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded!
2024 Mar 28 22:49:03 NILE2 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded! (message repeated 2 times)
2024 Mar 28 22:51:58 NILE2 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool-group
buffer 90 percent threshold is exceeded! (message repeated 3 times)
---------------------------------------------------------------------------------------
datasw-aa-03# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 21 18:30:34 datasw-aa-03 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 27 02:09:17 datasw-aa-03 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 27 02:11:18 datasw-aa-03 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
---------------------------------------------------------------------------------------
datasw-aa-04# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 27 00:56:36 datasw-aa-04 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 28 22:49:58 datasw-aa-04 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 28 22:51:58 datasw-aa-04 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
---------------------------------------------------------------------------------------
datasw-aa-06# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 21 20:53:36 datasw-aa-06 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 21 22:13:36 datasw-aa-06 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
2024 Mar 21 22:15:36 datasw-aa-06 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded!
---------------------------------------------------------------------------------------
datasw-aa-08# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
2024 Mar 27 16:44:26 datasw-aa-08 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 27 16:46:26 datasw-aa-08 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
2024 Mar 27 16:55:56 datasw-aa-08 %TAHUSD-SLOT1-4-BUFFER_THRESHOLD_EXCEEDED: Module 1 Instance 0 Pool
-group buffer 90 percent threshold is exceeded! (message repeated 1 time)
---------------------------------------------------------------------------------------
datasw-aa-10# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
datasw-aa-10#
---------------------------------------------------------------------------------------
datasw-aa-11# sh logg |include BUFFER_THRESHOLD_EXCEEDED | last 3
datasw-aa-11#
The interesting part is I only do not see this issue on datasw-aa-10 and 11 "C92160YC-X [ BIOS: version 07.41 | NXOS: version 7.0(3)I3(1) ]"
Dear experienced network engineers...
Even before I find the command "show interface counters errors non-zero" I was suffering with "sh int | include discard".
As you can see I don't know how to check logs, monitor ports etc.
Please help me to find the root cause. What should I do?