"Datacenter troubleshooting guide” – a blog by Gilles Dufour.
Day 6 - Understanding me-sats
This week I was aked to give some information regarding the me-stats.
This is a large topic but it is indeed an important part in the troubleshooting process.
I could simply send you the meaning of each counter, but I think it would confuse you or create some panic.
Instead, I'm going to focus on the most important counters and I will give you the steps I use to identify the important ones.
First, we need to review the design of the ACE module as we need to understand the path followed by a packet inside the blade to know which counters to look at and when.
The module is divided in 2 parts as I mentioned in a previous blog. The Control Plane or CP, and the Data Plabe or DP.
DP is itself divided in several Micro Engines or ME.
Each ME has a specific function:
|RX||Receives all incoming traffic and buffers data|
|FastPath||Process all incoming packets, tries to match them to existing connections and direct traffic to other ME's|
|ICM||Inbound Connection Manager receives all packets not matching an existing connection and applies configured action|
|Outbound Connection Manager processes packets going out, applies outbound ACL and NAT|
|TCP||For connections terminated by ACE, handles TCP 3-way handshaked, TCP options, ...|
|HTTP||All http functions - matching cookies, url, ...|
|Reassembly||Processes fragmented IP packets.|
So, when troubleshooting your ACE module, or looking for its status, or checking performance, you typically have to follow the ME in the same order as the path of a packet inside the blade.
First ME is RX which will buffer all incoming traffic.
switch/Admin# show np 1 me-stats "-srx -v"
Receive Statistics: (Current)
Idle: 37253614 100861
Frames Received: 69012481 73
Control Frames Received: 25552296 39
Forward Buffered: 69012481 73
Post stalls: 0 0
Packet drops: 0 0
Error(bad rbuf): 0 0
Error(rbuf parity): 0 0
Error(rbuf skip): 0 0
Error(missing eop): 0 0
Error(missing sop): 0 0
Error(data buf alloc fail): 0 0
Error(control buf alloc fail): 0 0
Last bad RBUF control word: 0 0
From this first ME we can collect some very important information like the traffic rate for this IXP.
In this example we can see 73 packet/sec.
We can also see that all "frames received" where successfully buffered ("Forward Buffered").
When your box is under heavy load, you start seeing "Post stalls" and "Packet drops".
If you do get post stalls and packet drops, it means the other ME's can't keep up with the level of traffic.
You need to modify somes rules from L7 to L4 or reduce the amount of traffic - by going active-active for example.
The next ME in the path is fastpath.
switch/Admin# show np 1 me-stats "-sfp -v"
Fastpath Statistics: (Current)
Errors: 4 0
FPTX Hi Priority receive: 25557857 39
Fastpath pkt received: 76857092 80
FPTX receive: 43469595 35
FastTX receive: 7575207 6
SlowTX receive: 254859 0
Packets transmit to hyperion: 12263591 9
Packets punt to CP: 13835061 13
Packets punt to Nitrox: 254800 0
Packets punt to Daughtercard: 0 0
Packets punt to other IXP: 17357 1
Packets transmitted (loopback): 0 0
Debug packet copy to CP: 0 0
Packets forward to ICM: 8617882 6
Packets forward to OCM: 0 0
Packets forward to TCP: 0 0
Packets forward to Fragmentation: 0 0
Packets IPCP forward: 102 0
Large buffer TX count: 0 0
WARN: TX Packet too small: 0 0
DROP: Packet too big error: 0 0
DROP: Connection Miss: 0 0
DROP: Bad connection route: 0 0
DROP: RX Interface miss: 12632166 11
DROP: Out of buffers: 0 0
DROP: Unknown Msg received: 24079409 37
DROP: Bandwidth rate policed: 2 0
Close request Sent: 1061629 0
Packets dropped (encap invalid): 0 0
Close request Sent: (encap mismatch): 0 0
Packets forward to SSL-ME: 0 0
Packets forward to SSL-XScale: 254800 0
Ack trigger msgs sent: 0 0
DROP: TO CP rate policed: 0 0
Wait for empty TFIFO: 306 0
FastQ Transmit Backpressure: 0 0
SlowQ Transmit Backpressure: 0 0
Hyperion Transmit Backpressure: 0 0
Drop: Transmit Backpressure: 0 0
Drop: Virtual MAC packets to standby: 1660 0
Drop: Shared MAC in non-shared interface 0 0
Drop: Next-Hop queue full: 0 0
Drop: Diag to SSL-ME: 0 0
Diag packets forwarded to SSL-ME: 0 0
Drop: Invalid IMPH Destination: 0 0
Drop: Invalid IMPH Next-Hop: 0 0
Drop: IP DF bit set: 0 0
Drop: No fragmentation of L3 Encap : 0 0
FastPath Jumbo pkt retransmit on BP : 0 0
Drop: exceed buffer threshold limit: 0 0
(Context ALL Statistics)
Packets forward to Reassembly: 0 0
Packets forward to XScale: 4900258 3
DROP: Connection Route: 1660 0
Packets forward, reproxy: 0 0
Packets forward, reproxy w/trigger: 0 0
Drop: Invalid connection hit: 0 0
Drop: Reproxy out of order: 0 0
All traffic has to go through RX and Fastpath.
But after FP, packets can go in different directions.
For example, they can be sent to CP (ie: probes, ssh, telnet,...) - this is counted under "Packets punt to CP".
Once again the number if the right most column is the packets/second.
Another possible direction is "Packets transmit to hyperion". This is the traffic sent out of the module. Back to the cat6k.
Packets can also be forwarded to other ME's like ICM ("Packets forward to ICM"), OCM ("Packets forward to OCM") or TCP("Packets forward to TCP").
Two interesting counters to monitor for indication of performace issue are :
Drop: Next-Hop queue full:
Drop: exceed buffer threshold limit:
First one is an indication that one of the next ME (ICM, OCM, or TCP) is not drainig its queue fast enough and therefore traffic is dropped.
The 2nd counter indicates that we're running out of buffers and traffic is dropped as a preventive measure to avoid total collapse.
You can see the level of buffer utilisation with the followin me-stats command :
switch/Admin# show np 1 me-stats "-scommon"
Common Statistics: (Current)
Internal buffers allocated: 70643567 72
Internal buffers released: 70640804 71
External buffers allocated: 767167 0
External buffers released: 763567 0
Hash lock contention count: 50 0
X TO ME Pkt count: 4861866 3
To know the amount of buffer currently used by the system, you need to substract number of buffers released from number of buffers allocated.
In this case : 70643567 - 70640804 = 2763.
We have 256k buffers and 2 thresholds. We drop new connections at 192k buffers used and we drop packets at 224k.
Next I will continue with ICM, OCM, TCP, ...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.