on 11-07-2011 07:02 AM
Understanding how to debug PPPoE in IOS-XR for the ASR9K
Detailed step through guide for debugging PPPoE sessions.
Before we start troubleshooting sessions it is important to understand the architecture of how things link together. This overview below shows the different components as it related to PPP(oE).
In the next modules the various debugs will be elaborated on and some common things that generally go wrong.
1) 1) A session is initiated in PPPoE by the reception of a PADI packet in PPPoE. This PADI is a broadcast packet hitting the control plane. If the session is terminating on a physical (sub)interface, the PPPoE is handled on the linecard, if the session terminates as part of a bundle (sub)interface, PPPoE is handled by the RSP.
2) 2) The reception of a PADI triggers a “session-start” event in the control policy, during the session-start event we need to apply the dynamic template that holds at least the LCP specific parameters for when the session continues to PPP phase. Also we could do “pre-authentication” here based on PPPoE tag information in the session-start event.
3) 3) After we have sent the PADO to the client, and the client selected us as BNG, a PADR will be received which triggers the session-activate event. In this event we need a template, either provided during session start with NCP parameters (like unnumbered info), or more specific template info can be provided during this session-activate event. At this time the subscriber interface is created when we transmit the PADS and are now commited for 3 mins to that subscriber session (in IOS, the ncp timeout).
4) 4) The session moves now to the PPP LCP phase and will try to complete the LCP based on the LCP parameters defined in the dynamic-template provided during the session-start event.
5) 5) During LCP we generally negotiate some sort of authentication protocol and when we do, we enter the authentication phase. While the control policy is still in session-activate state we are starting our CHAP or PAP exchange to retrieve user credentials.
6) 6) Even if CHAP/PAP is not negotiated we can still do an authorization request, but probably not on line username/password in the absence of identity retrieval by PPP/Auth. This authen action is defined in the session-start event of the control policy
7) 7) In XR, in the absence of local authentication capabilities, we need to use radius (or tacacs/less common) to initiate the access-request
8) 8) The radius interaction will give us a response, or not when it times out. Either case, an event is triggered in the policy-plane again that we can trigger on to provide further directives or we “can live” with the response radius has given us.
9) 9) Events triggered are authen-failure (access-reject) or authen-no-response (timeout from our radius-server/list). If the response is a success/accept, we are continuing the session and start NCP.
10) 10) IPCP is started and when completed the route is installed.
This document will focus on debugging and understanding each of these components in in-depth level.
This picture above shows the 4 important packets from PPPoE.
The PADI is a broadcast packet, the PADO is a packet with SMAC of the BNG and the DMAC of the subscriber.
The subscriber will send a unicast PADR packet with the DMAC of the BNG it wants to establish a session with, followed by a PADS from the BNG with the pppoe session id.
This session id is unique for the segment and is to be used for EVERY packet that is sent forward from this session.
Note: The BNG will verify the SMAC, arriving access interface and pppoe session ID to prevent spoofing
The session ID is part of the 8 byte pppoe header that is slapped on to the packet (hence the 1492 maximum MTU size when pppoe encap is used.
PPPoE header explained:
MAC header | PPPoE header | Data ::: |
MAC header: This is a standard ethernet II header with ethertype 0x8863 for PPPoE control (PADx messages) and 0x8864 for PPPoE data packets, including PPP/LCP.
PPPoE header:
00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
Data ::: |
Version. 4 bits.
Protocol version. Must be set to 1.
Type. 4 bits.
Must be set to 1.
Code. 8 bits.
Session ID. 16 bits, unsigned.
Length. 16 bits.
Size of the Data field in bytes.
Data. Variable length.
First step in debugging is to evaluate the PPPoE protocol and packets.
Provides Basic packet level debugging from a pppoe protocol point of view
RP/0/RSP0/CPU0:Sep 1 15:27:09.632 : pppoe_ma[346]: [PADI-Recv]: Bundle-Ether100.100 peer-mac 0019.2f43.9a38
0019.2f43.9a38: this is the subscriber mac-address
RP/0/RSP0/CPU0:Sep 1 15:27:09.632 : pppoe_ma[346]: [PADI-Recv]: vlan-id-outer 100
RP/0/RSP0/CPU0:Sep 1 15:27:09.632 : pppoe_ma[346]: [PADI-Recv]: Service-name:
RP/0/RSP0/CPU0:Sep 1 15:27:09.633 : pppoe_ma[346]: [PADO-Sent]: Bundle-Ether100.100 peer-mac 0019.2f43.9a38
RP/0/RSP0/CPU0:Sep 1 15:27:09.633 : pppoe_ma[346]: [PADO-Sent]: vlan-id-outer 100
RP/0/RSP0/CPU0:Sep 1 15:27:09.634 : pppoe_ma[346]: [PADR-Recv]: Bundle-Ether100.100 peer-mac 0019.2f43.9a38
RP/0/RSP0/CPU0:Sep 1 15:27:09.634 : pppoe_ma[346]: [PADR-Recv]: vlan-id-outer 100
RP/0/RSP0/CPU0:Sep 1 15:27:09.634 : pppoe_ma[346]: [PADR-Recv]: Service-name:
RP/0/RSP0/CPU0:Sep 1 15:27:09.853 : pppoe_ma[346]: [PADS-Sent]: Bundle-Ether100.100 peer-mac 0019.2f43.9a38
RP/0/RSP0/CPU0:Sep 1 15:27:09.853 : pppoe_ma[346]: [PADS-Sent]: vlan-id-outer 100
The format of the debug is:
NODE : Time : <process>[pid] : [PACKET-direction] Interface mac-addr
RP/0/RSP0/CPU0:Sep 1 15:27:09.632 : pppoe_ma[346]: [PADI-Recv]: Bundle-Ether100.100 peer-mac 0019.2f43.9a38
In this example the PPPoE session completed perfect, though little information on the packet contents are provided.
Is pppoe completing properly to a PADS
Is the interface shown expected
Is the mac address expected
Is the vlan (stack) what we’d expect?
Provides detailed packet contents from pppoe
PADI
RP/0/RSP0/CPU0:Sep 1 15:37:36.372 : pppoe_ma[346]: Bundle-Ether100.100: I dst ffff.ffff.ffff src 0019.2f43.9a38: len 46 0x1109
0000000401010000000000000000000000000000000000000000000000000000000000000000000000000000
Although the protocol is not provided here because debug pppoe protocol is off, we can tell that this is a PADI because the packet is a broadcast destination on L2 and also the packet starts with 1109 the “09” being the PADI packet
The added 0’s are expected here also because the PADx packets are very small and we need to add padding to them.
The LEN provided (46) is the Ether header length.
The PPPoE length is highlighted in ORANGE
PADO
RP/0/RSP0/CPU0:Sep 1 15:37:36.372 : pppoe_ma[346]: Bundle-Ether100.100: O dst 0019.2f43.9a38 src b4a4.e392.208b: len 21 0x11070000000f010100000102000
741394b2d424e47
PADR
RP/0/RSP0/CPU0:Sep 1 15:37:36.374 : pppoe_ma[346]: Bundle-Ether100.100: I dst b4a4.e392.208b src 0019.2f43.9a38: len 46 0x11190000000f0102000741394b2
d424e470101000000000000000000000000000000000000000000000000000000
PADS
RP/0/RSP0/CPU0:Sep 1 15:37:36.591 : pppoe_ma[346]: Bundle-Ether100.100: O dst 0019.2f43.9a38 src b4a4.e392.208b: len 10 0x11651ed0000401010000
In RED is the assigned pppoe session ID. When the subscriber interface is created that should match this number in decimal, in this case bundle-ether100.100.pppoe7888
Packet type list:
Type Code Direction
PADI 0x09 In only
PADO 0x07 Out only
PADR 0x19 In only
PADS 0x65 Out only
PADT 0xa7 In and Out
Are the packet types expected
Are the lengths matching up for the packet to the actual content
Are we expecting the packets in the direction they are meant to be send
Is the mac address expected
Is the vlan (stack) what we’d expect?
RP/0/RSP0/CPU0:Sep 1 15:51:22.917 : pppoe_ma[346]: Session: Bundle-Ether100.100: Initializing new session
RP/0/RSP0/CPU0:Sep 1 15:51:22.917 : pppoe_ma[346]: Session: Creating and inserting a new session idb on parent interface Bundle-Ether100.100, with se
ssion id 7889
RP/0/RSP0/CPU0:Sep 1 15:51:22.917 : pppoe_ma[346]: Session: Successfully added new session idb to database of parent Bundle-Ether100.100. Parent sess
ion IDB count = 3
RP/0/RSP0/CPU0:Sep 1 15:51:22.918 : pppoe_ma[346]: Session: Queuing interface create on parent interface Bundle-Ether100.100, with session id 7889
LC/0/0/CPU0:Sep 1 15:51:23.038 : pppoe_ea[288]: Interface: 0x0007b920: Creating and inserting a new intf idb
LC/0/1/CPU0:Sep 1 15:51:23.039 : pppoe_ea[288]: Interface: 0x0007b920: Creating and inserting a new intf idb
LC/0/0/CPU0:Sep 1 15:51:23.039 : pppoe_ea[288]: Interface: 0x0007b920: Successfully added new intf idb to database
LC/0/0/CPU0:Sep 1 15:51:23.039 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE received
LC/0/0/CPU0:Sep 1 15:51:23.039 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE(1), idb lookups/creates completed successfully
LC/0/1/CPU0:Sep 1 15:51:23.040 : pppoe_ea[288]: Interface: 0x0007b920: Successfully added new intf idb to database
LC/0/1/CPU0:Sep 1 15:51:23.040 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE received
LC/0/1/CPU0:Sep 1 15:51:23.040 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE(1), idb lookups/creates completed successfully
LC/0/0/CPU0:Sep 1 15:51:23.044 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE(2), Hardware programming completed successfully
LC/0/1/CPU0:Sep 1 15:51:23.045 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE(2), Hardware programming completed successfully
LC/0/0/CPU0:Sep 1 15:51:23.045 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE(3), IMP interface create completed successfully
LC/0/1/CPU0:Sep 1 15:51:23.046 : pppoe_ea[288]: Interface: 0x0007b920: INTF CREATE(3), IMP interface create completed successfully
RP/0/RSP0/CPU0:Sep 1 15:51:23.050 : pppoe_ma[346]: Session: 0x00000000: Received interface create cb: No error
RP/0/RSP0/CPU0:Sep 1 15:51:23.052 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Inserting a session idb into global database
RP/0/RSP0/CPU0:Sep 1 15:51:23.053 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Successfully added new session idb to global database. Ses
sion IDB count = 3
RP/0/RSP0/CPU0:Sep 1 15:51:23.053 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Received MTU notification, with MTU actual 1500
RP/0/RSP0/CPU0:Sep 1 15:51:23.057 : pppoe_ma[346]: Session: Received AAA batch start notication
RP/0/RSP0/CPU0:Sep 1 15:51:23.057 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Received AAA Session Create cb: No error
RP/0/RSP0/CPU0:Sep 1 15:51:23.058 : pppoe_ma[346]: Session: Received AAA batch end notication
RP/0/RSP0/CPU0:Sep 1 15:51:23.075 : pppoe_ma[346]: Session: Received SubDB batch start notication
RP/0/RSP0/CPU0:Sep 1 15:51:23.075 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Received Activate Config cb: No error
RP/0/RSP0/CPU0:Sep 1 15:51:23.076 : pppoe_ma[346]: Session: Received SubDB batch end notication
RP/0/RSP0/CPU0:Sep 1 15:51:23.159 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Interface ready call succeeded
RP/0/RSP0/CPU0:Sep 1 15:51:23.188 : pppoe_ma[346]: Session: Received AAA batch start notication
RP/0/RSP0/CPU0:Sep 1 15:51:23.188 : pppoe_ma[346]: Session: Bundle-Ether100.100.pppoe7889: Received AAA Session Up Complete cb
RP/0/RSP0/CPU0:Sep 1 15:51:23.189 : pppoe_ma[346]: Session: Received AAA batch end notication
This command is not necessarily that useful for normal troubleshooting, but it does signify the MTU being set on the interface and the session creation from PPPoE internally.
PPPoE session generally won’t establish for various reasons, here are some gotcha’s that I have ran into many many times
PPP consists of 3 phases, LCP, Authentication and NCP in order to establish a session allowing the transport of L3 protocols over it.
During LCP initial link parameters are agreed on such as MRU (like MTU), authentication protocol. If an Authentication protocol is agreed on, then we enter the Auth phase in which credentials are exchanged that likely are being handed off to radius for verification.
After that the NCP phase start which allows us to establish an L3 protocol connection such as IPCP for ipv4 or OSICP for CLNS etc. A9K supports only IPCP and soon IPV6CP (with xr430).
During LCP and NCP the exchange of options is done via the following packets. Each of these packets contains options.
CONFIGURE-REQUEST
A packet sent with options that a side of the ppp connection likes to propose
CONFIGURE-ACK
A response to a request indicating that all options in the request are acceptable
CONFIGURE-REJECT
A response to a request indicating that an option can’t be honoured (eg one side does MLP the other side does not, we reject the option
CONFIGURE-NAK
A response to a request indicating that the options in the NAK packet can be fulfilled, but the option value is not desirable. Eg one side proposes CHAP authentication and the other side can only do PAP.
Requests and responses are linked together by a field in the packet called the “ID”.
After the sessions is completed, LCP echo requests, a keepalive mechanism, are exchanged and have to be acknowledged.
Keepalives from the peer are responded by the 9k in hardware. The 9k also originates keepalives which have to be responded by the peer.
Some helpful hints explaining the debugs from the component
RP/0/RSP0/CPU0:Sep 15 12:43:18.244 : PPP-MA[343]: LCP: Bundle-Ether100.100.pppoe1: [Initial]: Up Event
RP/0/RSP0/CPU0:Sep 15 12:43:18.248 : PPP-MA[343]: LCP: Bundle-Ether100.100.pppoe1: [Initial]: Change to state Closed
In this debug start, the location is RSP0 because PPP for bundle sessions as indicated by the interface are terminated by the RSP. Sessions terminating on a phy interface are terminated on the linecard.
I’ll remove the node and interface moving forward for space issues in the debug.
LCP debug
.244 LCP: [Initial]: Up Event
.248 LCP: [Initial]: Change to state Closed
.249 LCP: [Closed]: Down Event
.250 LCP: [Closed]: Change to state Initial
.250 LCP: [Initial]: Up Event
.251 LCP: [Initial]: Change to state Closed
.358 LCP: [Closed]: Open Event
.358 LCP: [Closed]: Initialize-Restart-Counter
.358 LCP: [Closed]: O CONFREQ id 1 len 19
.359 LCP: [Closed]: MRU 1492 (0x010405d4)
.359 LCP: [Closed]: AuthProto CHAP (0x0305c22305)
.359 LCP: [Closed]: MagicNumber 0x1e7824c4 (0x05061e7824c4)
Identify the protocol state LCP, the LCP state itself in brackets [ ], the direction of the packet O output, I input the packet type (confreq) and the ID. At this point we expect a response from the peer on our ID 1.
.360 LCP: [Closed]: Change to state Req-Sent
.360 LCP: [Req-Sent]: Open Event
.361 LCP: [Req-Sent]: I CONFREQ id 1 len 25
.361 LCP: [Req-Sent]: MRU 1492 (0x010405d4)
.362 LCP: [Req-Sent]: MagicNumber 0x659ff390 (0x0506659ff390)
.362 LCP: [Req-Sent]: MRRU 1524 (0x110405f4)
.362 LCP: [Req-Sent]: EndpointDisc 1 Local (0x13070174657374)
We have move to state request sent, and we are getting an incoming configure request, also with ID 1, we are expected to send a response to ID1 now as well. Note that I deliberately let the client request options for MRRU and EndpointDisc, used for multilink that we don’t support on the 9k.
.363 LCP: Peer's MRRU: Option received on non-MP interface (reject)
.363 LCP: Peer's ED: Option received on non-MP interface (reject)
PPP identifies already the unsupported option
.363 LCP: [Req-Sent]: O CONFREJ id 1 len 15
.364 LCP: [Req-Sent]: MRRU 1524 (0x110405f4)
.364 LCP: [Req-Sent]: EndpointDisc 1 Local (0x13070174657374)
Here we are sending a REJECT on ID1,which is a response to the peers request ID1 in BLUE
We are rejecting the options for MLP. We should expect a new request to come in from the peer with a new offer. Hopefully, the peer is not persistent in its multilink request. If it is, and requests it again, we reject it again. How long that can go on? That will be the ppp max-configure configuration knob, default is 3.
After 3 attempts to configure the link we will terminate the ppp session.
.373 LCP: [Req-Sent]: I CONFACK id 1 len 19
.373 LCP: [Req-Sent]: MRU 1492 (0x010405d4)
.373 LCP: [Req-Sent]: AuthProto CHAP (0x0305c22305)
.374 LCP: [Req-Sent]: MagicNumber 0x1e7824c4 (0x05061e7824c4)
The peer did like out proposal and acknowledged it. If the peer didn’t like our CHAP authentication protocol but can do PAP, it would have sent us a PPP CONFNAK packet with authproto option in there. This would indicate to us to switch auth protocols IF we are configured for that.
If not, and we are persistent on CHAP, we end up in a dead lock also and after default of 3 attempts to configure we drop the session
.374 LCP: [Req-Sent]: Initialize-Restart-Counter
.374 LCP: [Req-Sent]: Change to state Ack-Rcvd
.374 LCP: [Ack-Rcvd]: I CONFREQ id 2 len 14
.374 LCP: [Ack-Rcvd]: MRU 1492 (0x010405d4)
.375 LCP: [Ack-Rcvd]: MagicNumber 0x659ff390 (0x0506659ff390)
We are getting a new proposal in from the peer, it retracted the mlp options, note that the ID has changed to 2. The ID MUST increase on every cycle sent.
.375 LCP: [Ack-Rcvd]: O CONFACK id 2 len 14
.375 LCP: [Ack-Rcvd]: MRU 1492 (0x010405d4)
.375 LCP: [Ack-Rcvd]: MagicNumber 0x659ff390 (0x0506659ff390)
.375 LCP: [Ack-Rcvd]: Change to state Open
That we can honor, so we send an ack on that proposal id2. At this point in time we have sent an ACK and received an ACK.
.376 LCP: [Open]: Report This-Layer-Up
IPCP debug
This handshake is the same as LCP, with the same packet types, just for a different protocol. IPCP instead of LCP.
.397 IPCP: [Initial]: I CONFREQ id 1 len 22
.398 IPCP: [Initial]: Address 0.0.0.0 (0x030600000000)
.398 IPCP: [Initial]: PrimaryDNS 0.0.0.0 (0x810600000000)
.398 IPCP: [Initial]: SecondaryDNS 0.0.0.0 (0x830600000000)
.398 IPCP: [Initial]: Conf-Req packet stalled
.400 IPCP: [Initial]: Open Event
.400 IPCP: [Initial]: Change to state Starting
.402 IPCP: [Starting]: Report This-Layer-Started
.402 IPCP: [Starting]: Up Event
.402 IPCP: [Starting]: Initialize-Restart-Counter
.403 IPCP: [Starting]: O CONFREQ id 1 len 10
.403 IPCP: [Starting]: Address 101.101.1.1 (0x030665650101)
Our address is the interface we are unnumbered to from the session, derived from the dynamic template or radius.
.404 IPCP: [Starting]: Change to state Req-Sent
.404 IPCP: [Req-Sent]: Restarting stalled Conf-Req packet
.405 IPCP: [Req-Sent]: I CONFREQ id 1 len 22
.405 IPCP: [Req-Sent]: Address 0.0.0.0 (0x030600000000)
.405 IPCP: [Req-Sent]: PrimaryDNS 0.0.0.0 (0x810600000000)
.405 IPCP: [Req-Sent]: SecondaryDNS 0.0.0.0 (0x830600000000)
.406 IPCP: [Req-Sent]: I CONFACK id 1 len 10
.406 IPCP: [Req-Sent]: Address 101.101.1.1 (0x030665650101)
.406 IPCP: [Req-Sent]: Initialize-Restart-Counter
.406 IPCP: [Req-Sent]: Change to state Ack-Rcvd
.408 IPCP: Peer's Primary DNS address: 0.0.0.0 (reject)
.408 IPCP: Peer's Secondary DNS address: 0.0.0.0 (reject)
.408 IPCP: [Ack-Rcvd]: O CONFREJ id 1 len 16
.408 IPCP: [Ack-Rcvd]: PrimaryDNS 0.0.0.0 (0x810600000000)
.408 IPCP: [Ack-Rcvd]: SecondaryDNS 0.0.0.0 (0x830600000000)
Client requested dns servers and we don’t have any info for them.
.409 IPCP: [Ack-Rcvd]: I CONFREQ id 2 len 10
.409 IPCP: [Ack-Rcvd]: Address 0.0.0.0 (0x030600000000)
Client requests an ip address
.410 IPCP: [Ack-Rcvd]: O CONFNAK id 2 len 10
.410 IPCP: [Ack-Rcvd]: Address 199.1.1.1 (0x0306c7010101)
We provide a pool address.
The client can also request a real address, but that if we ack that be a semi static ip address, so many providers prefer to renegotiate the address and send a nak always on the address option and provide a new pool address.
.411 IPCP: [Ack-Rcvd]: I CONFREQ id 3 len 10
.411 IPCP: [Ack-Rcvd]: Address 199.1.1.1 (0x0306c7010101)
.411 IPCP: [Ack-Rcvd]: O CONFACK id 3 len 10
.411 IPCP: [Ack-Rcvd]: Address 199.1.1.1 (0x0306c7010101)
.411 IPCP: [Ack-Rcvd]: Change to state Open
.412 IPCP: [Open]: Report This-Layer-Up
LCP echos debug
RP/0/RSP0/CPU0:Sep 15 12:43:19.405 LCP: [Open]: I ECHOREQ id 1 len 12 magic 0x659ff390
RP/0/RSP0/CPU0:Sep 15 12:43:19.405 LCP: [Open]: O ECHOREP id 1 len 12 magic 0x1e7824c4
Incoming echo requests and replies, note that the ID is incrementing every time.
Also note the magic numbers in this packet which are the Magics negotiated during LCP.
A9K does NOT do magic number validation.
RP/0/RSP0/CPU0:Sep 15 12:43:29.646 LCP: [Open]: I ECHOREQ id 2 len 12 magic 0x659ff390
RP/0/RSP0/CPU0:Sep 15 12:43:29.646 LCP: [Open]: O ECHOREP id 2 len 12 magic 0x1e7824c4
Note: I requested the lcp echo debug to be moved over to a different debug command. At the time of writing this article it is part of the debug ppp negotiation
When the control policy determines that user authentication is in order, an access-request with several attributes is sent to the RADIUS server.
The radius server can check these attributes and reply back with either an ACCESS-ACCEPT or an ACCESS-REJECT.
The Reject can have a reply message stating the reason for reject, but that is optional.
An ACCEPT can come with reply attributes that will instruct the BNG to apply features or session parameters on the session
Some helpful hints explaining the debugs from the component
Some things that can go wrong with RADIUS are:
NOTE: The radius-profile for an IPoE session should NOT include any service-type. Having Framed-User as service type in an access accept for an IP subscriber will cause the session to fail.
The format of the debug is:
NODE : Time : <process>[pid] : [PACKET-direction] Interface mac-addr
RP/0/RSP0/CPU0:Sep 1 15:27:09.632 : pppoe_ma[346]: [PADI-Recv]: Bundle-Ether100.100 peer-mac 0019.2f43.9a38
Thanks. I found it also in BRKSPG-2904.
Can you please help me out to understand the output (it's attached)
We have around 26K PPP sessions.
Two MOD80 where only NP0 is in use (bay 1).
Under QoS-EA I see Total 51813 for loc0/1 and for loc0/2 it's 25861.
Total number of session on BE10 (loc0/1) is: 12965
On BE20 (loc0/2) it's 13021
We are using RP based session but BE has only one member.
The usual bottleneck in high scale deployments is the number of allocated L3 queues. We support 8k queues at Level 3.
SUMMARY per NP:
=========================
Policy Instances: Ingress 25913 Egress 25915 Total: 51828
Entities: (L4 level: Queues)
Level Chunk 0 Chunk 1 Chunk 2 Chunk 3
L4 20( 20/ 20) 4( 4/ 4) 4( 4/ 4) 4( 4/ 4)
L3(8Q) 5( 5/ 5) 1( 1/ 1) 1( 1/ 1) 1( 1/ 1)
L3(16Q) 0( 0/ 0) 0( 0/ 0) 0( 0/ 0) 0( 0/ 0)
L2 4( 4/ 4) 1( 1/ 1) 1( 1/ 1) 1( 1/ 1)
L1 7( 7/ 7) 0( 0/ 0) 0( 0/ 0) 0( 0/ 0)
You can explicitly bind a subscriber access interface to a chunk of your choice:
interface Bundle-Ether8.2 service-policy output svlan-parent subscriber-parent resource-id 1
where the "resource-id" is the chunk number.
Hello,
We tried to put cisco on production yesterday and have some issues, I will describe my scenario:
Licenses:
FeatureID: A9K-BNG-LIC-8K (Slot based, Permanent)
Total licenses 2
Available for use 2
Allocated to location 0
Active 0
Store name Permanent
Store index 1
Pool: Owner
Total licenses in pool: 2
Status: Available 2 Operational: 0
We enable it and started to receive connections, but we had two problems:
The connections didn't get over 8600 sessions (debugging I couldn't see any explicit errors, but It shows up in connecting status). The licenses never get in "allocated/active status".
The other issue I saw is accounting records being duplicated, I'm pasting my config below:
aaa accounting network default start-stop group radius
aaa group server radius RADIUSGROUP
server IP auth-port 1812 acct-port 1813
source-interface TenGigE0/0/2/0
!
dynamic-template
type ppp temp
ppp authentication pap
ppp ipcp dns x.x.x.x
ppp ipcp peer-address pool poolIPs
accounting aaa list default type session periodic-interval 60
ipv4 unnumbered TenGigE0/0/2/0
!
type service REDIRECIONAMENTO
service-policy type pbr polREDIRECIONAMENTO
!
!
interface TenGigE0/0/2/1.40
service-policy type control subscriber politicaPPP
pppoe enable bba-group provider
encapsulation dot1q 40
!
interface TenGigE0/0/2/1.41
service-policy type control subscriber politicaPPP
pppoe enable bba-group provider
encapsulation dot1q 41
!
interface TenGigE0/0/2/1.42
service-policy type control subscriber politicaPPP
pppoe enable bba-group provider
encapsulation dot1q 42
!
interface TenGigE0/0/2/1.43
service-policy type control subscriber politicaPPP
pppoe enable bba-group provider
encapsulation dot1q 43
!
interface TenGigE0/0/2/1.44
service-policy type control subscriber politicaPPP
pppoe enable bba-group provider
encapsulation dot1q 44
!
aaa attribute format CSID
format-string length 253 "%s.%s:%s" physical-port outer-vlan-id client-mac-address-ietf
!
aaa radius attribute calling-station-id format CSID
aaa accounting subscriber default group radius
aaa authorization subscriber default group radius
aaa authentication subscriber default group radius
pppoe bba-group provider
sessions mac limit 1
!
class-map type control subscriber match-any classePPP
match protocol ppp
end-class-map
!
!
policy-map type control subscriber politicaPPP
event session-start match-all
class type control subscriber classePPP do-until-failure
1 activate dynamic-template temp
!
!
event session-activate match-first
class type control subscriber classePPP do-until-failure
10 authenticate aaa list default
20 authorize aaa list default format FULL_AUTH password use-from-line
!
!
end-policy-map
!
it's hard to say from 10000 feet. Sessions can fail to connect at many points. Best is to run "sh subscriber manager disconnect-history last" to see the reason for disconnect.
If subscribers fail to authenticate, a quick look into "sh radius" will also show you whether the server is reachable, whether it's the radius server that sends a disconnect, whether there's a timeout reaching the radius, etc.
/Aleksandar
Alek, do you see any reason for duplicated acct records?
Do you licensing session limit a possible issue cause I can't see the licenses in activated mode / allocated.
What do you mean by duplicate?
Based on your AAA configuration you are sending interim accounting every 60 seconds.
BNG licensing is
Do you have anything under "sh subscriber manager disconnect-history last"? Aleks told you to check this out.
When the customer connects, I'm seeing two "open connections" on radacct / freeradius and I can see two auth requests for each customer on Radius Log, as example:
Tue Feb 7 00:48:24 2017 : Auth: Login OK: [mel2l] (from client asr1 port 0 cli 1.240:c8-3a-35-45-b1-98)
Tue Feb 7 00:48:24 2017 : Auth: Login OK: [mel2l] (from client asr1 port 0 cli 1.240:c8-3a-35-45-b1-98)
:85)
It seems its some sort of session throttle, I'm getting lots of this:
LC/0/0/CPU0:Feb 8 21:01:47.126 : pppoe_ma[309]: [PADR-Recv]: TenGigE0/0/2/1.136 peer-
LC/0/0/CPU0:Feb 8 21:01:47.126 : pppoe_ma[309]: [PADR-Recv]: DROPPED - session throttle or flow control
LC/0/0/CPU0:Feb 8 21:01:47.129 : pppoe_ma[309]: [PADR-Recv]: TenGigE0/0/2/1.101 peer-mac 1c7e.e5c5.ef4f
LC/0/0/CPU0:Feb 8 21:01:47.129 : pppoe_ma[309]: [PADR-Recv]: DROPPED - session throttle or flow control
LC/0/0/CPU0:Feb 8 21:01:47.132 : pppoe_ma[309]: [PADR-Recv]: TenGigE0/0/2/1.97 peer-mac 84c9.b2de.4f6c
LC/0/0/CPU0:Feb 8 21:01:47.132 : pppoe_ma[309]: [PADR-Recv]: DROPPED - session throttle or flow control
0/RSP0/CPU0
-----------
BBA-Group pppoegroup
-------------------
MAC throttles not configured.
MAC-Access-interface throttles not configured.
MAC-IWF-Access-interface throttles not configured.
Circuit-ID throttles not configured.
Remote-ID throttles not configured.
Inner-VLAN-ID throttles not configured.
Outer-VLAN-ID throttles not configured.
VLAN-ID throttles not configured.
Circuit-ID-and-Remote-ID throttles not configured.
0/0/CPU0
--------
BBA-Grouppppoegroup
-------------------
MAC throttles not configured.
MAC-Access-interface throttles not configured.
MAC-IWF-Access-interface throttles not configured.
Circuit-ID throttles not configured.
Remote-ID throttles not configured.
Inner-VLAN-ID throttles not configured.
Outer-VLAN-ID throttles not configured.
VLAN-ID throttles not configured.
Circuit-ID-and-Remote-ID throttles not configured.
RP/0/RSP0/CPU0:ios#show interfaces tenGigE 0/0/2/1
Wed Feb 8 21:06:08.080 UTC
TenGigE0/0/2/1 is up, line protocol is up
Interface state transitions: 1
Hardware is TenGigE, address is 00c1.6463.0501 (bia 00c1.6463.0501)
Layer 1 Transport Mode is LAN
Internet address is Unknown
MTU 1514 bytes, BW 10000000 Kbit (Max: 10000000 Kbit)
reliability 255/255, txload 0/255, rxload 0/255
Encapsulation ARPA,
Full-duplex, 10000Mb/s, SR, link type is force-up
output flow control is off, input flow control is off
==============================
Configured Access Interfaces
==============================
Ready 599
Not-Ready 0
-----------------------------
TOTAL 599
==============================
PPPoE Sessions
==============================
Complete 9702
Incomplete 406
-----------------------------
TOTAL 10108
==============================
Flow Control
==============================
Limit 2000
In Flight 1918
Dropped 363126
Disconnected 19472
Successful 15705
Can't see what could be limiting the connections.
you are hitting the in flight flow control limit. that means the maximum number of embryonic connections. it is like what CAC used to do in IOS.
RP/0/RSP0/CPU0:A9K-BNG#show pppoe summary total location 0/rsP0/CPU0
....<output omitted>...
==============================
Flow Control
==============================
Limit 2000
In Flight 0
Dropped 0
Disconnected 1
Successful 9
You can adjust the inflight window, default 2000 via:
RP/0/RSP0/CPU0:A9K-BNG(config)#pppoe in-flight-window 2000
xander
Xander,
By default, it came with a 200 limit flight window, and subscribers connected until 8300 subscribers. Then I changed the in-flight-windows to 2000. And nothing else changed.
My current config is:
!
pppoe in-flight-window 2000
!
show pppoe summary total location 0/rsP0/CPU0
Thu Feb 9 08:15:50.686 UTC
==============================
Configured Access Interfaces
==============================
Ready 0
Not-Ready 0
-----------------------------
TOTAL 0
==============================
PPPoE Sessions
==============================
Complete 0
Incomplete 0
-----------------------------
TOTAL 0
==============================
Flow Control
==============================
Limit 2000
In Flight 0
Dropped 0
Disconnected 0
Successful 0
show pppoe summary total location 0/0/CPU0
Thu Feb 9 08:16:04.556 UTC
==============================
Configured Access Interfaces
==============================
Ready 599
Not-Ready 0
-----------------------------
TOTAL 599
==============================
PPPoE Sessions
==============================
Complete 0
Incomplete 0
-----------------------------
TOTAL 0
==============================
Flow Control
==============================
Limit 2000
In Flight 0
Dropped 777903
Disconnected 86550
Successful 17024
Sorry, I forgot to mention... I'm expecting this router to connect around 16k.
What should be a coherent value for flight window?
hard to say not knowing the backend. if there are 2000 sessions in flight, I would want to see where the stagnation is, is that because of radius timeouts/delays/waiting for authentication. it would mean that the radius server is not capable of driving the sessions we offer, so increasing that number to 3 or 4 thousand is only aggravating that problem.
best to see is via the radius stat what the server's performance is and optimize the incoming session rate based on that. if we see an average response time of say 2 seconds but our timeout is 1 second, it would mean that we are sending a lot of duplicate requests that have to be handled.
it could also be that there is a lot of (re)negotiation on LCP/NCP going on, retransmissions or etc. we need to look for the ppp disconnect reason, if any, to see if there is an optimization being done.
the session throttling we do at the incoming is effectively protecting the backends and ensuring that sessions we are committed to have a chance of succeeding, depending on timeouts (LCP, retrans/max-configure) auth timeout and ncp timeout would need to match what the sessions and radius are able to cover. otherwise we still cant complete sessions we committed to.
cheers!
xander
One thing I've noted when the limit was reached out, was duplicated auth requests arriving on the freeradius server which is coherent to what you said would happen.
Maybe I could simulate the issue limiting PADI reaching the server (dropping access interface ports a while).
Do you think I could adjust some sort of timeout to wait a bit more for freeradius answers? What could I do to accomplish this?
I'm working with a freeradius server with 8GB RAM and a MySQL server with 12GB RAM with good processors. I will try again today, and I'll monitor resource usage as radius stats.
This what I got from show radius:
Global dead time: 0 minute(s)
Number of Servers: 1
Server: x.x.x.x/1812/1813 is UP
Address family: IPv4
Total Deadtime: 0s Last Deadtime: 0s
Timeout: 5 sec, Retransmit limit: 3
Quarantined: No
Authentication:
186439 requests, 0 pending, 185 retransmits
177131 accepts, 9307 rejects, 0 challenges
186 timeouts, 0 bad responses, 0 bad authenticators
0 unknown types, 0 dropped, 1001 ms latest rtt
Throttled: 0 transactions, 0 timeout, 0 failures
Estimated Throttled Access Transactions: 0
Maximum Throttled Access Transactions: 0
Automated TEST Stats:
0 requests, 0 timeouts, 0 response, 0 pending
Accounting:
41969 requests, 0 pending, 21317 retransmits
36140 responses, 27146 timeouts, 0 bad responses
0 bad authenticators, 0 unknown types, 0 dropped
862 ms latest rtt
Throttled: 0 transactions, 0 timeout, 0 failures
Estimated Throttled Accounting Transactions: 0
Maximum Throttled Accounting Transactions: 0
Automated TEST Stats:
0 requests, 0 timeouts, 0 response, 0 pending
What do you think about those stats?
Hi,
HW resources should be ok for your radius server.
You can check again with "htop" on linux what is the current CPU usage. This tool is great.
You could also add a second radius and the BNG will load balance between them.
If you want to be sure that it's the radius server fault you can create a subscriber policy map without authentication, just activate dyn map.
You need to set up the dyn templet with the pool, ipv4 unumbered etc
If all session come up you will know who is the bad guy
I'm shaping/polycing each subscriber.
Quoting you Xander:
Do you mean it will be a limit of 8K because I'm shaping subscribers?
"So the total sum of sessions on that MPA can NOT exceed 32k in a mod80. There is no per port limit. However if you have shaping QOS, you need to configure the QOS resources such as to where you need them on which port. Because the parent shapers are in 8k chunks, and if you have one interface with 10k sessions, you need to assign 2 chunks to that interface. This chunking stuff only applies to sessions that have shaping needs (policing not affected by this)."
Would that be the problem? and what would be "assign 2 chunks to that interface"?
Thank you
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: