Life of a Packet inside a GSR

Vignesh Rajendran Praveen · ‎09-13-2013

GSR packet path is divided into three stages: input, crossing the fabric, and output. Each one of these stages is very simple;
All GSR linecards run CEF, and all share the same components for a majority of the packet path, which makes it very simple.To understand the packet path, you first need to understand what the basic functional sections on a GSR linecard are. They are:

P ----> toFabBMA ----->                                 <---- toFabBMA <-----   P
          |                      ________________            /
L        |                      |                |         /                    L
switching engine            FIA switch fabric FIA
I         |                 /   |________________|                              I
          |               /
M <---- frFabBMA <----                                   -----> frFab ----->   M

LC = Line Card

PLIM = Physical Layer Interface Module. The PLIM is all the media-specific stuff for a given
interface - transceivers, SONET framers, SAR & MAC ASICS, and HDLC/PPP framers.

BMA = Buffer Management ASIC. The BMA handles allocation of buffers in the linecard.

Switching engine = depending on the linecard, one of three kinds.
The switching engine is composed of multiple parts, and will be taken up in more detail shortly.

FIA = Fabric Interface. The section commonly known as the FIA is actually composed of the Fabric Interface ASIC and several Serial Line Interface ASICs (FIA and SLIs, respectively) which deal with ciscoCell transmission to and from the fabric; more details on this later.

Switch fabric = made of 1-2 CSCs and 0-3 SFCs, this is the part that coordinates packet (well, cell, actually) transmission between linecards. More on this later.

The only things that really differentiate GSR linecards from one another are the PLIM and the switching engine. There are older-style PLIMs like 4xOC3 and 1xOC12 (in both POS and ATM flavors), and newer ones like 4xOC12, 1xOC48, and GigE. Other PLIMs are in the works (8xFE, 4xOC12 ATM, and more), but they all work about the same. PLIMs have application-specific pieces to them (for example, the ATM PLIM has a SAR, and the GigE PLIM has a MAC ASIC), but the theory of the packet path is the same across all PLIMs. This document concentrates on the POS PLIM, but useful differences are noted where necessary. Please note that 'toFab' (towards the Fabric) and 'Rx' (received bythe box) are two different names for the same thing, as are 'frFab' and 'Tx'. For example, the toFab BMA is also referred to as the RxBMA. This document uses the toFab/frFab convention, but you may see the Rx/Tx nomenclature used elsewhere. We will now follow a packet step-by-step through receipt, switching,and transmission.

Receive Path
============

PLIM
----------

First, a packet comes into the PLIM. Various things happen here; there's a transceiver to turn optical signals into electrical (most GSR linecards have fiber connectors), any SONET/SDH framings are removed, and ATM cells are reassembled. Packets that fail CRC are thrown away. As the packet is received and processed, it's DMA'd into a buffer called the "input FIFO". This input FIFO varies in size depending on the card and switching engine, but is between 128K and 1MB. Once a packet is fully into the input FIFO, an ASIC on the PLIM contacts the BMA and asks for a buffer to put the packet in. The BMA is told what size the packet is, and allocates a buffer appropriately.If the BMA cannot get a buffer of the right size, the packet is dropped and the 'ignore' counter is incremented on the incoming interface. Once the BMA has a buffer allocated for the packet, the packet is transferred from the input FIFO into said buffer. While this is going on, the PLIM could be receving another packet into the input FIFO. If the input FIFO fills up, you will start dropping packets and record input ignores.

RxBMA
--------------
There are two physical BMAs on the LC; one programmed to act in receive mode, and one to act in transmit mode. They esentially do the same thing, but in two different directions. Each BMA has "packet memory", which is SDRAM. This is where the buffers are carved from. These buffers can be displayed using 'sh contr tofab queue' on the LC
(output below):

LC-Slot2#sh contr tofab queue
Carve information for ToFab buffers
   SDRAM size: 67108864 bytes, address: 30000000, carve base: 30029100
   66940672 bytes carve size,    number of SDRAM banks: 2
2 carve(s)
   max buffer data size 4544 bytes, min buffer data size 80 bytes
   65534/65534 buffers specified/carved
   66443744/66443744 bytes sum buffer sizes specified/carved

Qnum Head Tail #Qelem LenThresh
---- ---- ---- ------ ---------

4 non-IPC free queues:

26828/26828 (buffers specified/carved), 40.93%, 80 byte data size
1 101 26928 26828 65535

18976/18976 (buffers specified/carved), 28.95%, 608 byte data size
2 26929 45904 18976 65535

13087/13087 (buffers specified/carved), 19.96%, 1568 byte data size
3 45905 58991 13087 65535

6543/6543 (buffers specified/carved), 9.98%, 4544 byte data size
4 58992 65534 6543 65535

   IPC Queue:
        100/100 (buffers specified/carved), 0.15%, 4112 byte data size
        30      97      96              100     65535

Raw Queue:
31 0 0 0 65535

   ToFab Queues:
        Dest
        Slot
        0       0       0               0       65535
        1       0       0               0       65535
        2       0       0               0       65535
        3       0       0               0       65535
        4       0       0               0       65535
        5       0       0               0       65535
        6       0       0               0       65535
        7       0       0               0       65535
        8       0       0               0       65535
        9       0       0               0       65535
        10      0       0               0       65535
        11      0       0               0       65535
        12      0       0               0       65535
        13      0       0               0       65535
        14      0       0               0       65535
        15      0       0               0       65535
Multicast      0       0               0        65535

This looks like a lot of output, but it's really pretty straightforward. There are 4 free queues used for packets (non-IPC free queues), in 4 different packet sizes.

Qnum Head Tail #Qelem LenThresh
---- ---- ---- ------ ---------

26828/26828 (buffers specified/carved), 40.93%, 80 byte data size
1 101 26928 26828 65535

This is from the first buffer pool. It shows that we carved 26,828 buffers of this size ('specified' means that we wanted to carve 26828, and 'carved' means we actually carved 26828). These buffers are 40.93% of the total buffers carved for this BMA.

There are two constraints in carving buffers:
1) We have only got a certain number of MBytes available
2) We have only got a certain number of buffers (aka Queue elements) available.

After a packet is put into its buffer in toFab SDRAM, that buffer is enqueued on the Raw Queue until a switching decision can be made (see 'switching engine', later on). The role of "decision maker" here is the big differentiator between the five switching engines (see the 'switching engine' section, later on). After the switching decision has been made, the packet is enqueued on one of the ToFab queues.

The toFab BMA cuts the packet up into 48-byte pieces, which are the payload for what will eventually be known as "ciscoCells". These cells are given an 8-byte header by the frFab BMA (total data size so far = 56 bytes), and then enqueued into the proper ToFab queue.

switching engine
---------------------------

The switching engine currently comes in one of 5 flavors. The only things different across these three engines are how fast they can switch packets, and what features they support.

The switching engine's job is to decide which output LC a packet will go to, as well as which port on that LC. A decision about which CEF rewrite string to use is also made, but the actual L2 encapsulation is done on the outbound card (saves on fabric bandwidth). A note on performance: give the aggregate plughole bandwidth of the PLIM and the pps rating of the switching engine, it is possible to determine the smallest packet size that can be switched at line rate.So at this point, the switching decision has been made and the packet has been enqueued onto the proper ToFab output queue. What next?

FIA
-------
The toFab BMA DMAs the cells that it's made into small FIFO buffers in the FIA (fabric interface ASIC). There are 17 FIFO buffers (one per ToFab queue). When the FIA gets a cell from the toFab BMA, it adds an 8-byte CRC (total cell size 64 bytes; 48 bytes payload for all but the first cell, 8 bytes cell header). The FIA has SLIs (serial line interface ASICs) that then performs 8B/10B encoding on the cell (same idea as FDDI's 4B/5B), and prepares to transmit it over the fabric.This may seem like a lot of overhead (48 bytes of data gets turned into 80 bytes across the fabric!), but it's not an issue; see 'the fabric', below.

Transmit Path
=============

terminolgy
------------------
This can get a little confusing. A FIA or BMA that _receives_ data from the fabric is involed in the _transmit_ of a packet out the physical interface. This document follows the convention of 'to fabric' and 'from fabric' to distinguish between the two paths, until somebody points me to a better way to do it. Put another way:

PLIM-->toFab BMA-->toFab FIA--->fabric--->frFab
FIA-->frFab BMA

the fabric
----------------
The fabric is composed of 1 or more CSCs (clock scheduler card) and 0-3 SFCs (switch fabric card). The job of an SFC is to pass a piece of a cell across the fabric. The CSC does the same thing as an SFC, and also both provides clocking to the fabric and controls access to it.

Assuming for a second that you have a fully populated switch fabric (2 CSC, 3 SFC), the FIAs slice each ciscoCell up into 4 16-byte pieces, and send one of these pieces across each of the last 4 switch cards (CSC1, SFC0, SFC1, SFC2). An XOR'd copy of these 16 is sent across CSC0, so that if one of the 4 "real" pieces of the cell are corrupted,it can be recovered with no loss of service.You can of course have less than 2 CSC/3 SFC. With a full config like that, you have redundant bandwith and clocking; you can lose the CSC who is your current clock source, and the other one will take over. You can lose 1 SFC, and you're OK.

So now that a FIA is ready to transmit, how does it work? The FIA requests access to the fabric from the currently active CSC. The CSC works on a pretty complex, patented fairness algorithm called eSLIP. The idea is that no LC is allowed to monopolize the outgoing bandwidth of any other card. Note that even if an LC wants to transmit data out of one its own ports, it still has to go through the fabric. This is important; if this didn't happen, one port on an LC could monopolize all bandwidth for a given port on that same LC. It'd also make the switching design more complicated.The FIA sends cells across the switch fabric to their outgoing LC (specified by some info in the buffer header that the switching engine put there). It's important to understand that the FIA transmits one cell a time; it does not make a fabric reservation for all the cells of a given packet. The fairness algorithm makes sure that everybody's happy with this arrangement. The fairness algorithm is also designed for optimal matching; if card 1 wants to transmit to card 2 and card 3 wants to transmit to card 4 at the same time, that happens in parallel.

That's the big difference between a switch fabric and a bus architecture. Think of it as analogous to an ethernet switch vs. a hub; on a switch, if port A wants to send to port B and C wants to talk to D, those two flows happen independently of each other. On a hub, there's nasty half-duplex stuff like collisions and backoff/retry algorithms. The ciscoCells that come out of the fabric are DMA'd into FIFOs on the frFab FIAs, and then into a buffer on the frFab BMA. The frFab BMA is the one who actually does the reassembly of cells into a packet. How does the frFab BMA know what buffer to put the cells in before it reassembles them? This is another decision made by the incoming linecard switching engine; since all free queues on the entire box are the same size and in the same order, the switching engine just has the tx LC put the packet in the same number queue that it enter the router into.

The frFab BMA SDRAM queues can be viewed with 'sh contr frfab queue')
on the LC:

LC-Slot5>sh contr frfab queue
Carve information for FrFab buffers
   SDRAM size: 67108864 bytes, address: 20000000, carve base: 2011D100
   65941248 bytes carve size,    number of SDRAM banks: 0 2 carve(s)
   max buffer data size 4544 bytes, min buffer data size 80 bytes
   65533/65533 buffers specified/carved
   65815792/65815792 bytes sum buffer sizes specified/carved

Qnum Head Tail #Qelem LenThresh
---- ---- ---- ------ ---------

4 non-IPC free queues:

25519/25519 (buffers specified/carved), 38.94%, 80 byte data size
1 6550 6549 25519 65535

19630/19630 (buffers specified/carved), 29.95%, 608 byte data size
2 30951 30950 19630 65535

14395/14395 (buffers specified/carved), 21.96%, 1568 byte data size
3 45250 59644 14395 65535

5889/5889 (buffers specified/carved), 8.98%, 4544 byte data size
4 59645 65533 5889 65535

   IPC Queue:
        100/100 (buffers specified/carved), 0.15%, 4112 byte data size
        30      67      66              100     65535

Raw Queue:
31 0 66 0 65535

   Interface Queues:
        0       0       0               0       65535
        1       0       6549            0       65535
        2       0       0               0       65535
        3       0       0               0       65535

This is basically the same idea as the toFab BMA output. Packets come in and are placed in buffers that are dequeued from their respective free queues. These buffers are then enqueued on either the interface queue or the rawQ for output processing. Typically, you'll see one output queue per physical output port.

The frFab BMA waits until the tx portion of the PLIM is ready to send a packet, then the frFab BMA does the actual MAC rewrite (based, remember, on info contained in the buffer header), and DMAs the packet over to a buffer in the PLIM circuitry (much like the input FIFO buffer, but on the output side). The PLIM does the ATM SAR and SONET framing where appropriate, and transmits the packet.