Re: Root cause of this logs Nexus 3064 PQ

Landry Hermann SAMBA · ‎09-01-2019

Hello All,

See below log found on nexus 3460 PQ

"%MTM-SLOT1-2-MTM_BUFFERS_FULL: MTM buffers are
full for unit 0. MAC tables might be inconsistent. Pls use l2 consistency-check"

What can be the consequences of this kind of log ? do the switch will continue to forwarding frame ?

What can be the root cause of this kind of log ? and how to solve the issue ?

Thanks for your replies.

Andrea Testino · ‎09-03-2019

Hi there,

Can you share the following `show` commands (copy/paste exactly as shown):

term width 511
term length 0

show version | i i image|chassis
show consistency-checker l2 module 1 
show mac address-table count
show logging log | i i mtm
show system internal mts buffers summary

Thank you,

- Andrea, CCIE #56739 R&S

alfredos · ‎10-03-2020

Hi Andrea,

I have arrived here with the same problem as the now missing OP.

Here is the info from our boxes:

NXOS image file is: bootflash:///nxos.9.3.4.bin
cisco Nexus3000 C3064PQ Chassis

The consistency checker l2 bounces from failing to succeeding. When failing, it shows a few MACs in the "Extra and Discrepant entries in the HW MAC Table" section. I haven't seen more than 8 MACs when failing.

Total MAC Addresses in Use (DLAC + DRAC + SLAC + SRAC + SAC): 1815

I do see a ton of %MTM-SLOT1-2-MTM_BUFFERS_FULL in the logs, about twice an hour... Immediately followed by something to the tune of "last message repeated 52 times" in the following few minutes.

The show system internal mts buffers summary yields different rows each time; this is the most complete I got:

* recv_q: not received yet (slow receiver)
* pers_q/npers_q/log_q: received not dropped (leak)
node sapno recv_q pers_q npers_q log_q app/sap_description
sup 284 0 3 0 0 netstack/TCPUDP process client MTS queue
sup 279 2 0 0 0 arp/ARP IPC SAP
sup 4336 1 0 0 0 arp/ARP MTS DATA SAP

We do run spanning tree and try to keep it to best practices, setting the priority of the bridges and such.

I wonder if our use of SLB with no LACP can contribute to this issue.

I found almost no trace about MTM buffers in the documentation. Is this anything that can be set and problem solved, or is life harder than we'd like as usual?

Thanks for your interest, and thanks in advance for any hint or pointer!

Andrea Testino · ‎10-05-2020

Hi Alfredo,

1. Have you noticed high CPU issues around the same time as the MTM logs appear? (would require some EEM or other script to capture when the log prints)

`show processes cpu history`
`show processes cpu sort | ex 0.0`

2. Is STP stable in the environment? `show spanning-tree detail | i i ieee|occur|Exec|from` -- You'd be looking for high # of topology changes as well as a low time since the last change.

3. During the issue, have you noticed whether its the same 8 MACs becoming inconsistent?

4. Have you tried doing a `clear mac address-table dynamic address xxxx.xxxx.xxxx` for the 8 MACs in question (this is assuming its always the same ones) and seeing if the syslogs persist?

5. Any MAC Move notifications around the MTM syslogs? `show logging log | i i l2fm|bcm` (May not have this enabled so I'd configure `logging level l2fm 5`)

Side note: May be worth implementing "logging level spanning-tree 6" to see any root changes, etc. that you may not be catching.

- Andrea, CCIE #56739 R&S

alfredos · ‎10-06-2020

Hi again Andrea,

1. Re the CPU usage I think it's a bit on the high side. SNMP reports 25% to 40% 24x7, and show processes cpu history says about the same but adds that maximums peak often at 100%; for example:

    888787897887878988889989888988788898978997888978777777787977989898878877
    719634718829382950240331380523693350591669228290576448647318406003781497
100                *           *      * *  **                     *
 90 * *   ** *     **   ** * * *   *  * *  **   **           *  * * * *
 80 ***** *********************************************  ***** *************
 70 ************************************************************************
 60 ************************************************************************
 50 ************************************************************************
 40 *************##################################*************#####****##*
 30 #**##*##################################################################
 20 ########################################################################
 10 ########################################################################
    0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
              0    5    0    5    0    5    0    5    0    5    0    5    0

                   CPU% per hour (last 72 hours)
                  * = maximum CPU%   # = average CPU%

Running "show processes cpu sort | ex 0.0" a few times shows often processes using 5% to 15% CPU, and haven't seen a single one hogging it (may need to sample a few hundred times or something, but I think if a process is hogging the CPU it should pop up easily). These are the ones with the longest overall runtime:

PID    Runtime(ms)  Invoked   uSecs  1Sec    Process
-----  -----------  --------  -----  ------  -----------
16358    130805080  119318887      0  15.60%  t2usd
13563    258142230  68031115      0   0.00%  pfmclnt
15835     91782830  572272319      0   0.00%  stp
16678     87001080  191550083      0   0.00%  stats_client
16344     86520940  30837487      0   0.00%  crdclient
13795     85112860  55659382      0   0.00%  sysinfo
15871     84038390  481804420      0   3.09%  hsrp_engine

2. Spanning tree is pretty stable. show spanning-tree detail | i i ieee|occur|Exec|from shows no change at all in almost 2 days, most VLANs having its last change months ago. Largest number of changes is 1292 for an especially often changed VLAN; this in a box with uptime of 132 days may sound a bit much at first, but most of those changes owe to a setup a few months ago. So I would say that spanning tree is probably not a root or even a contributor to problems.

3. Which MACs are inconsistent? Different every time. Not a single one to nuke and see if it was doing anything weird, unfortunately.

4. Re clearing the offending MACs, there is a script that runs clear mac address-table dynamic address <MAC> when it finds an inconsistency as reported by show consistency-checker l2 module 1, every 5 minutes. It's an ugly hack, but at least it seems to contain the issue a little.

5. MAC Move notifications: We do get some %L2FM-2-L2FM_MAC_FLAP_DISABLE_LEARN_N3K, but these don't seem to correlate with %MTM-SLOT1-2-MTM_BUFFERS_FULL.

Thanks for your help! Hope this also helps others, as there doesn't seem to be a lot online about this problem. We did find something interesting about Open vSwitch behaviour that may explain at least the MAC Moves.