Solved: Traffic is dropping after reload because of AAR policy

dijix1990 · ‎07-07-2023

vmanage - 20.9.3

cedges - 20.9.3a

I noticed that with AAR policy after reloading device all of the traffic is dropping for 15-20 minutes

I started packet-trace and saw - DROP 483 (SdwanDataPolicyDrop)

show platform packet-trace packet 1
Packet: 1           CBUG ID: 1
Summary
  Input     : GigabitEthernet0/0/0.920
  Output    : GigabitEthernet0/0/0.920
  State     : DROP 483 (SdwanDataPolicyDrop)
  Timestamp
    Start   : 560305080124 ns (07/08/2023 05:12:19.338387 UTC)
    Stop    : 560305192379 ns (07/08/2023 05:12:19.338500 UTC)
Path Trace
  Feature: IPV4(Input)
    Input       : GigabitEthernet0/0/0.920
    Output      : <unknown>
    Source      : 172.26.98.4
    Destination : 172.18.7.22
    Protocol    : 1 (ICMP)
  Feature: CFT
    API                   : cft_handle_pkt
    packet capabilities   : 0x0000018c
    input vrf_idx         : 0
    calling feature       : STILE
    direction             : Input
    triplet.vrf_idx       : 6
    triplet.network_start :  0x100bf92
    triplet.triplet_flags : 0x00000000
    triplet.counter       : 26
    cft_bucket_number     : 1313395
    cft_l3_payload_size   : 64
    cft_pkt_ind_flags     : 0x00000000
    cft_pkt_ind_valid     : 0x00000931
    tuple.src_ip          : 172.26.98.4
    tuple.dst_ip          : 172.18.7.22
    tuple.src_port        : 5060
    tuple.dst_port        : 51060
    tuple.vrfid           : 4
    tuple.l4_protocol     : ICMP
    tuple.l3_protocol     : IPV4
    vrf_nums              : 1
    pkt_sb.num_flows      : 0
    pkt_sb.tuple_epoch    : 26
    returned cft_error    : 14
    returned fid          : 0
  Feature: NBAR
    Packet number in flow: N/A
    Classification state: Final
    Classification name: ping
    Classification ID: 1404 [CANA-L7:479]
    Candidate classification sources:
      N/A
    Classification visibility name: ping
    Classification visibility ID: 1404 [CANA-L7:479]
    Number of matched sub-classifications: 0
    Number of extracted fields: 0
    Is PA (split) packet: False
    Is FIF (first in flow) packet: False
    TPH-MQC bitmask value: 0x0
    Source MAC address: 70:0B:4F:FF:C7:C1
    Destination MAC address: 00:87:64:80:06:30
    Traffic Categories:
      ms-office-365/category: unset
      ms-office-365/service-area: unset
      sdavc/feed-id:   0
      webex/region:   0
  Feature: SDWAN App Route Policy
    VPN ID       : 15
    VRF          : 6
    Policy Name  : _VPN-12_Branch-Voice_AAR-VOIP-BRANCH_VPN-10-11_15_Branch_AAR-DATA-BRANCH-VPN-10-11_15_Branch (CG:3)
    Seq          : 1
    Req SLA      : Default (1)
    Act SLA      : __all_tunnels__ (0)
    Policy Flags : 0x21
    Fallback to best Path : no
    SLA Strict   : Yes
    Actual Color : Undetermined (0)
    Preferred Color : biz-internet public-internet  (0x30)
    Tunnel Match Reason : MATCHED_NONE_SLA_STRICT

I use AAR to force voip traffic to be routed to the mpls channel, and prevent the rest of the traffic from using the mpls channel

sh sdwan policy from-vsmart
from-vsmart sla-class Default
 loss    25
 latency 300
 jitter  100
from-vsmart sla-class Realtime
 loss    1
 latency 150
 jitter  30
from-vsmart app-route-policy _VPN-12_Branch-Voice_AAR-VOIP-BRANCH_VPN-10-11_13_15-16_Branch_AAR-DATA-BRANCH
 vpn-list VPN-10-11_13_15-16_Branch
  sequence 1
   match
    source-data-prefix-list aar-data-global
    destination-ip          0.0.0.0/0
   action
    sla-class       Default
    sla-class strict
    sla-class preferred-color biz-internet public-internet
 vpn-list VPN-12_Branch-Voice
  sequence 1
   match
    source-ip      10.10.0.0/16
    destination-ip 10.10.0.0/16
   action
    backup-sla-preferred-color biz-internet public-internet
    sla-class       Realtime
    no sla-class strict
    sla-class preferred-color mpls
from-vsmart lists vpn-list VPN-10-11_13_15-16_Branch
 vpn 10-11
 vpn 13
 vpn 15-16
from-vsmart lists vpn-list VPN-12_Branch-Voice
 vpn 12
from-vsmart lists data-prefix-list aar-data-global
 ip-prefix 172.16.0.0/12
 ip-prefix 192.168.0.0/19

Kanan Huseynli · ‎07-14-2023

Based on our investigation, it looks like misbehavior.

Remote device failed -> BFDs go down -> local device still tries to create tunnel to previously known devices inform -> local device counts SLA parameters for next poll intervals and include them in SLA measurement. And this happens due to OMP graceful restart (known TLOCs are not purged when OMP peering is down - reasonable).

Misbehavior is remote devices still include poll intervals for calculation, while BFD is down (100% loss).

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

View solution in original post

dijix1990 · ‎07-08-2023

It happens when I enable option Strict/Drop, if I change to Load Balance traffic goes, but I notice that mpls channel starts to use for forwarding not only for voip traffic

Kanan Huseynli · ‎07-08-2023

Did you do packet trace on reloaded device or one of the remote devices? And what is your BFD parameters (poll interval and multiplier)?

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

dijix1990 · ‎07-08-2023

I did it after reboting, but when I had vmanage and cedges version 20.7 it worked normal, but now after reloading device it waits Poll Interval. stupid behaviour after rebooting... cisco sdwan becames worse (it's my opinion. I compare with vmware, I use it for 100 branches).

Default poll interval 10 minutes so traffic doesn't go after reboot for 10 minutes

Kanan Huseynli · ‎07-09-2023

So, rebooting device and device where you did trace is the same right?

What I suspects, when remote device fails, tunnels (bfd) go down for that device (as expected). But also any local device which had information about that remote node (which reloaded), counts poll interval results (%-loss) for previously known tunnels (tloc to tloc).

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

dijix1990 · ‎07-09-2023

Yes it's the same, I repeated it for some devices, c8000v, 1111x, 4331. They have the same behaviour, after reloading traffic is dropping until one poll interval expires. It happens only when aar has action Strict/Drop

Kanan Huseynli · ‎07-09-2023

If the same, it is strange behavior. For remote sites, it can be understandable because of previously known BFD information and poll interval, but for local route (where reboot happens) it is strange.

Share "show sdwan app-route stats" and "show sdwan app-route sla" outputs immediately after reboot.

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

dijix1990 · ‎07-09-2023

I'll try to do it today (show sdwan app-route stats)

should command "show sdwan app-route sla" show something especial?

show sdwan app-route sla
                                                      APP PROBE
INDEX   NAME                  LOSS  LATENCY  JITTER   CLASS ID   APP PROBE CLASS       FALLBACK BEST TUNNEL
-------------------------------------------------------------------------------------------------------------------------------------------
0       __all_tunnels__       0     0        0        0          None                  None
1       Default               25    300      100      0          None                  None
2       Realtime              1     150      30       0          None                  None

Kanan Huseynli · ‎07-11-2023

No,this command gives information about SLA classes and then we may use to compare with actual tunnel values to understand whether tunnel meets SLA or not.

Do reboot and show result of show sdwan app-route stats immediately after it.

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

dijix1990 · ‎07-11-2023

show sdwan app-route stats remote-system-ip 10.80.100.102
app-route statistics 10.20.10.10 10.10.10.10 ipsec 12346 12346
 remote-system-ip         10.80.100.102
 local-color              biz-internet
 remote-color             public-internet
 sla-class-index          0
 fallback-sla-class-index None
 app-probe-class-list None
  mean-loss    0
  mean-latency 0
  mean-jitter  0
  interval 0
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
  interval 1
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
app-route statistics 10.30.10.10 10.10.10.10 ipsec 12346 12346
 remote-system-ip         10.80.100.102
 local-color              public-internet
 remote-color             public-internet
 sla-class-index          0
 fallback-sla-class-index None
 app-probe-class-list None
  mean-loss    0
  mean-latency 0
  mean-jitter  0
  interval 0
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
  interval 1
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
app-route statistics 192.168.1.198 192.168.1.219 ipsec 12346 12346
 remote-system-ip         10.80.100.102
 local-color              mpls
 remote-color             mpls
 sla-class-index          0
 fallback-sla-class-index None
 app-probe-class-list None
  mean-loss    0
  mean-latency 0
  mean-jitter  0
  interval 0
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
  interval 1
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0

I did after restarting immediately and there weren't packets until first poll interval left. If I change action AAR to load balance there will be packets immediately after restarting

Kanan Huseynli · ‎07-11-2023

Please, share several outputs result within 10min time frame.

For example, output after 3minute, 5minute, 8minute.

Are tunnels(bfd) UP during the first 10min (poll interval)?

What are the bfd configuration parameters?

Interval 10min, multiplier 2?

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

dijix1990 · ‎07-11-2023

After starting all of the bfd's are up. bfd was configurated with Interval 5 min, multiplier 2

exactly after 5min I can see increase packets

show sdwan app-route stats remote-system-ip 10.80.100.102
app-route statistics 10.20.10.10 10.10.10.10 ipsec 12346 12346
 remote-system-ip         10.80.100.102
 local-color              biz-internet
 remote-color             public-internet
 sla-class-index          0
 fallback-sla-class-index None
 app-probe-class-list None
  mean-loss    1
  mean-latency 1
  mean-jitter  0
  interval 0
   total-packets     166
   loss              3
   average-latency   1
   average-jitter    0
   tx-data-pkts      809
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
  interval 1
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
app-route statistics 10.30.10.10 10.10.10.10 ipsec 12346 12346
 remote-system-ip         10.80.100.102
 local-color              public-internet
 remote-color             public-internet
 sla-class-index          0
 fallback-sla-class-index None
 app-probe-class-list None
  mean-loss    0
  mean-latency 1
  mean-jitter  0
  interval 0
   total-packets     167
   loss              1
   average-latency   1
   average-jitter    0
   tx-data-pkts      1134
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
  interval 1
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
app-route statistics 192.168.1.198 192.168.1.219 ipsec 12346 12346
 remote-system-ip         10.80.100.102
 local-color              mpls
 remote-color             mpls
 sla-class-index          0
 fallback-sla-class-index None
 app-probe-class-list None
  mean-loss    0
  mean-latency 0
  mean-jitter  0
  interval 0
   total-packets     166
   loss              1
   average-latency   0
   average-jitter    0
   tx-data-pkts      169
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0
  interval 1
   total-packets     0
   loss              0
   average-latency   0
   average-jitter    0
   tx-data-pkts      0
   rx-data-pkts      0
   ipv6-tx-data-pkts 0
   ipv6-rx-data-pkts 0

Kanan Huseynli · ‎07-12-2023

You dont have rx-data,strange.

What do you see in remote node (10.80.100.102)? This output, bfd result etc.

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.

dijix1990 · ‎07-12-2023

I'll check and share with you

Hm, maybe is it new behaviour for AAR strict? it depends on 1st poll interval. I mean if I change poll interval to 10 min traffic doesn't sent until 1st poll interval is gone

Kanan Huseynli · ‎07-12-2023

I don't think so. Because it is "bad user experience" when customer should wait poll interval. Let's see what happens on remote node after reboot of local router. Check the same output result on (sysIP) 10.80.100.102 , while local device is in reboot state, after reboot, within first poll interval.

I suspect, problem in return traffic. This also can be verified by capturing on remote device.

HTH,
Please rate and mark as an accepted solution if you have found any of the information provided useful.