ME 3600 RP and BFD affected by Loop (DoS) on CE LAN non Inteli.

NetService Smartnet · ‎05-08-2013

Hello all

We did implemented a METRO running MPLS L3 VPN on +100 sites.

The ring topology connects several ME-3600X-24FS-M - IOS Version 12.2(52)EY2.

The issue we are facing is caused by the fact that most customer connected to the METRO don´t have inteligent L2/L3 switches or routers and are using the L3 interface on the ME has the default gateway to their LANs. Not a good design i would say but we can´t change that at moment.

L3 interface on ME (PE) to LAN SW (CE):

interface GigabitEthernet0/7

description CONEXAO CE XXX

port-type nni

no switchport

ip vrf forwarding GLOBAL

ip address 10.xx.0.1 255.255.255.0

Where the ip address above is the default GW for customer LAN.

The links between ME are configured this way:

description UPLINK ME-ME

port-type nni

no switchport

mtu 1600

ip address 10.XX.YY.WW 255.255.255.252

no ip redirects

ip ospf authentication message-digest

ip ospf message-digest-key 2 md5 7 135740425E5E547B7A76786166381C2324

ip ospf network point-to-point

ip ospf mtu-ignore

mpls ip

bfd interval 70 min_rx 70 multiplier 3

When a loop happens in a customer LAN, for instance connecting the same cable in two distinct interfaces on the same switch, the RP on the CPU gets above 40% and BFD drops the conection between the MEs, OSPF and BGP adjacencies are also killed.

We are pursuing ways of reduce the impact of the DoS, but so far no definitive solution was obtained.

1) Control Plane Policy cannot be implemented because most featured on IOS 12.2EY are not available in IOS 15.2(2)S or above for CPP.

2) Storm Control seems to be enable in L3 interfaces (Strange because should be only L2) and we tested it in a LAB for broadcast but only works for certain type of traffic generated by the loop (CDP and STP not included)

interface GigabitEthernet0/1

port-type nni

no switchport

no ip address

storm-control broadcast level 6.00

storm-control action shutdown interface GigabitEthernet0/1
port-type nni
no switchport
no ip address
storm-control broadcast level 6.00
storm-control action shutdown

3) We are also considering changing teh BFD parameters on the Uplinks to other MEs but do not have so far como values the would avoid the incident

Can any help please, advise on possible solutions to avoid Customer LANs to cause this type of issue on the ME3600

Regards

Pedro

Mathias Zimmermann · ‎06-12-2013

Hi Pedro,

I noticed the same issue while testing in the lab with ME3800X. As soon as we have a spanning-tree loop, our BFD sessions between the two ME3800X start bouncing. That was even when configuring storm-control on the interfaces. After a lot of trial and error, I found a workaround:

(1) Change the values of the default CoPP settings:

platform qos policer cpu queue 1 10000000 1000000

Queue 1 is used for the routing protocol, and default CIR is 1M. I tried 5M, but at least in my tests I got best results with 10000000 1000000.

(2) The above CoPP change improved things dramatically, but BFD was still bouncing every 10 minutes. I changed the BFD timers to: bfd interval 50 min_rx 300 multiplier 3, originally the min_rx was set at 50.

Although this is a workaround, it still doesn't make sense that customer traffic can affect the control-plane traffic - this is a pretty serious issue. CoPP needs to be enhanced for the ME3800 to allow for proper policing policies, as at the moment it only allows for creating a MAC based ACL, which is not sufficient.

Regards

Mathias

Jean-Marie NGOK GWEM · ‎06-26-2013

Why can't you add an ingress policer at the edge to drop violate traffic before any rewrite ?

Mathias Zimmermann · ‎06-26-2013

A standard inbound policer was already applied on the service instance, which polices the traffic down to the commited customer CIR of 100Mbps, but the problem still occurs. Customer traffic on a service instance shouldn't affect the control plane of the switch.

Regards,

Mathias

Jean-Marie NGOK GWEM · ‎06-26-2013

True. Customer traffic shouldn't affect this. Is it a DoS attack ?

Jean-Marie NGOK GWEM · ‎07-03-2013

Did you find any solution ?

Mathias Zimmermann · ‎07-03-2013

Not yet - only a workaround so far by changing the Control Plane Policing settings:

platform qos policer cpu queue 1 10000000 1000000

and under the relevant BFD enabled interfaces, changing the timers to:

bfd interval 50 min_rx 300 multiplier 3

We have a TAC case open and waiting on Cisco to reproduce this in the lab. The two commands above mitigate the problem, but this should really be fixed in IOS.

Jean-Marie NGOK GWEM · ‎07-04-2013

We did use the same workaround. But this is not a scalable solution . Because if this occurs again are we going to increase the CoPP policer for queue 1 and the BFD timers.

In the meantime TAC says that all the routing protocols and BFD are using queue1. And Multicast and OSPF are using Queue 7

Mathias Zimmermann · ‎07-04-2013

Yea it was a lot of trial and error on my part to find out which CoPP queue and setting was needed to create a temporary workaround, but for sure this can't be the final solution.

Jean-Marie NGOK GWEM · ‎07-04-2013

I wonder why storm control is not shutting down the gig interface when the broadcast storm reach level 6. According To cisco storm control is sending probes every 200ms. With BFD timers set to 600ms, storm control should restore the interface before queue 1 starts dropping exceeded bfd packets.. thoughts?

ME 3600 RP and BFD affected by Loop (DoS) on CE LAN non Inteli. SW