Cisco SD-Access Practical Deployment Best Practices - Scenarios

Karthik Kumar Thatikonda · ‎08-22-2019

Purpose of the document
Reference Topology
Use-Case
Symptom
Diagnosis
Solution
Additional References

Purpose of the document

This document describes the general recommendations or best practices when designing and deploying the Cisco SD-Access technology. The document assumes that the reader has a general overview of Cisco's SD-Access for Distributed Campus architecture, it's components and operation.

Reference Topology

Figure1: Illustration of Reference logical topology use-case

Use-Case

In the above scenario (Figure1), the user in Fabric Site1 with IP address 192.168.6.148 is trying to reach an internet destination 100.X.X.X. In this use-case, internet access for all the users in Fabric Site1 is via the Fabric Site2 Borders. This is typical user access to Internet flow in SD-Access for Distributed Campus using SD-Access Transit architecture. To arrive at the best practice recommendation, in this document the use-case would be split into as follows:

Symptom
Diagnosis
Solution

Symptom

Imagine that the specific traffic flow described above is not working, however, the users in Fabric Site1 are able to access other DC/DHCP/DDI resources which are logically shown on the upper left corner in the reference topology.

Diagnosis

After further examining the problem, it is identified that the internet traffic towards the user is getting dropped on the Fabric Site2 borders.

Figure2: Illustration of the actual data packet flow

Data Packet Flow:

Ideally, at a high-level, traffic must go over first VXLAN tunnel from Fabric Site2 Borders towards Fabric Site1 Borders, then traverse from Fabric Site1 Borders towards the Fabric Edge / Access where that specific user (192.168.6.148) is attached via another VXLAN tunnel. In this document, to keep our focus on the packet flow, the reference logical topology is further simplified as shown above (Figure2).

Now, let us examine the troubleshooting steps at a high-level during problem state. In this section, let us first ask some basic questions as below:

Is the underlay network routing between RLOC1 and RLOC2 working? Yes! Ping works!

RLOC1#ping 10.4.30.7
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.4.30.7, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms

RLOC2#ping 10.4.30.11
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.4.30.11, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms

Since, RLOC2 is where the problem is, can we check if the overlay is fine? No! See Below
Let us examine the map-cache for the user on RLOC2

RLOC2#sh lisp instance-id 4101 ipv4 map-cache

LISP IPv4 Mapping Cache for EID-table default (IID 4097), 2 entries

192.168.6.148/32, uptime: 00:19:28, expires: 23:40:31, via map-reply, self, complete

  Locator     Uptime    State      Pri/Wgt     Encap-IID

  10.4.30.10  00:19:28  route-rejec 10/10        4101 

Above, see route-reject

Let us check the underlay routing again towards RLOC1 IP 10.4.30.11

RLOC2#sh ip route 10.4.30.11
% Subnet not in table

RLOC2#sh ip route 
<trimmed>

Gateway of last resort is 10.4.1.30 to network 0.0.0.0

B* 0.0.0.0/0 [20/0] via 10.4.1.30, 03:42:35

Since underlay routing is working via default-route you would assume overlay would just work. Not Really! See Below

RLOC2#sh run | sec router lisp

router lisp

<trimmed>

 ipv4 locator reachability exclude-default

The above CLI means that from the Fabric Site2 Border perspective, if the RLOC / Fabric Site1 Border is reachable via default route, then this would stop/prevent the packets to be encap'ed into the VXLAN tunnel

Solution

At the time of writing this document, the solution is very simple as explained in the previous section, one of the design consideration to be aware of is that the underlay routing between the Site Borders in different Fabric Sites must not be learned via DEFAULT route.

Let us fix the problem now on RLOC2; For simplicity sake and just as an example, let us use /32 route towards RLOC1

RLOC2#sh ip route 10.4.30.11
Routing entry for 10.4.30.11/32
Known via "static", distance 1, metric 0
Routing Descriptor Blocks:
* 10.4.1.30
Route metric is 0, traffic share count is 1
10.4.1.26
Route metric is 0, traffic share count is 1

Now, let us examine the overlay state for the user in the Fabric Site1

RLOC2#sh lisp instance-id 4101 ipv4 map-cache 192.168.6.148 
LISP IPv4 Mapping Cache for EID-table default (IID 4097), 3 entries

192.169.6.148/32, uptime: 1d02h, expires: 21:14:18, via map-reply, self, complete
Sources: map-reply
State: complete, last modified: 1d02h, map-source: 10.4.30.8
Active, Packets out: 95847(7472810 bytes) (~ 00:00:00 ago)
Locator      Uptime  State   Pri/Wgt Encap-IID
10.4.30.11 1d02h   up        10/10 4101
Last up-down state change: 1d02h, state change count: 1
Last route reachability change: 1d02h, state change count: 1
<trimmed>

The users in the Fabric Site1 are now able to access the internet. Everything is working now !!!!

Additional References

Additional materials related to the overall solution can be found from https://www.cisco.com/c/en/us/solutions/design-zone/networking-design-guides/digital-network-architecture-design-guides.html

lanlanlan · ‎08-27-2019

great