OSPF P2MP Issue

terrygwazdosky · ‎01-18-2016

We're running OSPF point-to-multipoint on two different asynchronous CES clouds so that we can use neighbor statements to define the bandwidth of slower neighbors. Up until now it has been working great, but now I'm experiencing an odd issue with one node on one of the clouds.

Recently I noticed that there was one node that shows the dead timer expiring for all the other neighbors at random times. OSPF is never down for more than a second or two before re-establishing neighbor adjacencies. The other neighbors are not logging dead timers expired, but do show that OSPF goes FULL with the problematic neighbor: %OSPF-5-ADJCHG: Process 1, Nbr 10.226.1.56 on GigabitEthernet0/1 from LOADING to FULL, Loading Done.

The problematic node is an ASR1006. The interface does not go down and is not showing any errors.

Here are the things I've tried so far that have not helped:

Opened a trouble ticket with our CES service provider, but they have not been able to find an issue on their end.
Increased the dead interval from 3 to 5 seconds on all nodes
Removed a shaping service policy from the interface
Replaced all the cables involved.
Upgraded the IOS from asr1000rp1-adventerprisek9.03.10.00.S.153-3.S to asr1000rp1-adventerprisek9.03.16.01a.S.155-3.

Here is the relevant config:

interface GigabitEthernet1/0/0
bandwidth 50000
ip address 10.226.126.1 255.255.255.224
no ip redirects
no ip proxy-arp
ip flow monitor my-monitor input
ip ospf authentication message-digest
ip ospf message-digest-key 1 md5 abcdefg12345
ip ospf network point-to-multipoint
ip ospf dead-interval 5
ip ospf hello-interval 1
load-interval 30
negotiation auto

!

router ospf 1
router-id 10.226.1.56
ispf
log-adjacency-changes detail
auto-cost reference-bandwidth 10000
timers lsa arrival 80
passive-interface default
no passive-interface GigabitEthernet1/0/0

network 10.226.126.0 0.0.0.31 area 0

neighbor 10.226.126.13 cost 2000
neighbor 10.226.126.12 cost 3333
neighbor 10.226.126.30 cost 200
neighbor 10.226.126.11 cost 2000
neighbor 10.226.126.2 cost 5000
neighbor 10.226.126.3 cost 3333
neighbor 10.226.126.4 cost 5000
neighbor 10.226.126.5 cost 3333
neighbor 10.226.126.6 cost 1000
neighbor 10.226.126.7 cost 5000
neighbor 10.226.126.8 cost 5000
neighbor 10.226.126.9 cost 5000
neighbor 10.226.126.10 cost 5000

Thank you for any input/insight you can provide.

Rolf Fischer · ‎01-18-2016

Hi,

I don't see the 'non-broadcast' keyword in the 'ip ospf network point-to-multipoint' line, have you configured this interface to use multicast hellos?

Rolf

terrygwazdosky · ‎01-18-2016

Rolf - yes, it is using multicast hellos.

Rolf Fischer · ‎01-18-2016

I remembered this older post: https://supportforums.cisco.com/discussion/12279701/issue-ospf-point-multipoint-over-ces-cloud, here you used unicast. Have you changed that on all routers on that segment?

Communication between the routers should work in either case but the per-neighbor cost assignment normally requires the non-broadcast type.

terrygwazdosky · ‎01-18-2016

Wow, good memory. :) At the time I had tried both to get around the issue I was having. Once you helped me with the proxy arp fix (thanks again!) I tried both and settled on multicast hellos so that I didn't have to define all neighbors on each node, just the ones that have lower bandwidth. This particular node is one of two with 50Mb connections and the rest vary.

Here's the "show ip ospf interface" output which shows the neighbor costs are correct:

GigabitEthernet1/0/0 is up, line protocol is up
Internet Address 10.226.126.1/27, Area 0, Attached via Network Statement
Process ID 1, Router ID 10.226.1.56, Network Type POINT_TO_MULTIPOINT, Cost: 200
Topology-MTID    Cost    Disabled    Shutdown      Topology Name
        0           200       no          no            Base
Transmit Delay is 1 sec, State POINT_TO_MULTIPOINT
Timer intervals configured, Hello 1, Dead 5, Wait 5, Retransmit 5
    oob-resync timeout 40
    Hello due in 00:00:00
Supports Link-local Signaling (LLS)
Cisco NSF helper support enabled
IETF NSF helper support enabled
Can be protected by per-prefix Loop-Free FastReroute
Can be used for per-prefix Loop-Free FastReroute repair paths
Index 1/6/6, flood queue length 0
Next 0x0(0)/0x0(0)/0x0(0)
Last flood scan length is 1, maximum is 37
Last flood scan time is 0 msec, maximum is 3 msec
Neighbor Count is 13, Adjacent neighbor count is 13
    Adjacent with neighbor 192.168.255.209
     Cost in topology Base, MTID-0 is 5000
    Adjacent with neighbor 10.21.255.255
     Cost in topology Base, MTID-0 is 5000
    Adjacent with neighbor 10.20.255.255
     Cost in topology Base, MTID-0 is 5000
    Adjacent with neighbor 10.12.255.255
     Cost in topology Base, MTID-0 is 5000
    Adjacent with neighbor 192.168.255.204
     Cost in topology Base, MTID-0 is 1000
    Adjacent with neighbor 192.168.255.197
     Cost in topology Base, MTID-0 is 3333
    Adjacent with neighbor 192.168.255.206
     Cost in topology Base, MTID-0 is 5000
    Adjacent with neighbor 192.168.255.205
     Cost in topology Base, MTID-0 is 3333
    Adjacent with neighbor 192.168.255.203
     Cost in topology Base, MTID-0 is 5000
    Adjacent with neighbor 10.122.255.255
     Cost in topology Base, MTID-0 is 2000
    Adjacent with neighbor 10.226.1.9
     Cost in topology Base, MTID-0 is 200
    Adjacent with neighbor 10.6.255.255
     Cost in topology Base, MTID-0 is 3333
    Adjacent with neighbor 10.102.255.255
     Cost in topology Base, MTID-0 is 2000
Suppress hello for 0 neighbor(s)
Cryptographic authentication enabled
    Youngest key id is 1

Rolf Fischer · ‎01-18-2016

Thanks, I just wanted to be sure that the network types match.

I always thought (for whatever reasons) that the neighbor cost command doesn't work on broadcast interfaces, obviously that's not true and documentation is clear on this point.

Unfortunately I don't have a good idea how to troubleshoot this issue. The hello interval is only 1 second, so debugging OSPF packets from a particular neighbor to the buffer is probably not recommendable.

terrygwazdosky · ‎01-21-2016

I got the config from reading this article: http://www.netcraftsmen.com/using-ospf-point-to-multipoint-on-ethernet/. ; Since implementing it I've found conflicting documentation from Cisco, most of it saying this should only work with unicast.

As to the issue, I tried another router and got the same result. After I told the service provider this they looked again and found errors on one of their OC rings. So it looks like this mystery is solved.

Rolf Fischer · ‎01-21-2016

Thanks for comming back and telling us how you solved the issue.

And thanks for teaching me something new about the IOS OSPF implementation ;)

Your configuration looked good and after your troubleshooting steps I would have focussed on the SP network as well. I didn't want to recommend something like IP SLA tracking because your outages were very short and I always try to avoid probes with such short intervals in production environments. Another idea was to set up an additional peering between the ASR1006 and another router on this segment with another routing protocol (for instance iBGP) and let BFD to the link monitoring, but I've never done this over a VPLS.