Strategies for Optimizing Network Performance

CiscoNet Training Solutions · ‎01-20-2014

Overview
TCP Window Scaling
Increase Link Bandwidth
Fabric Enabled Line Cards
Cisco Express Forwarding (CEF)
Server Load Balancing
Routed Access Layer
No Oversubscription
Quality of Service (QoS)
Cisco Multichassis Link Aggregation
OSPF Graceful Restart
Minimize WAN Protocols
Cisco WAAS Optimizer
Jumbo Frames
802.11n Standard
Device Memory
SSD Network Server Drives
Performance Routing (PfR)
Network Redundancy
Routing Protocol Tuning
Web Page Optimization

Overview

This article discusses some of the most effective solutions for increasing and optimizing network performance. In addition the strategies improve availability and scalability. Improving performance, availability and scalability are foundational design requirements for any enterprise network.

TCP Window Scaling

The purpose of TCP Window Scaling is to increase the TCP window size (RWIN) to multiples of the default 65KB traditional size. That increases the maximum RWIN available to 1 GB (1,000,000,000 bytes) for performance optimization. The TCP Window Scaling option is a multiplier sent to the receiver during the TCP 3-way handshake to set RWIN size for the session. The larger TCP window size increases network throughput for faster high latency WAN links. The Window Scaling feature is defined with RFC1323 and part of the operating system (Windows and Linux) TCP stack implementations. The Window Scaling feature fixes performance problems with WAN links that have high bandwidth delay product (BDP). For instance deploying Gigabit Ethernet across a long haul circuit with high latency (150 msec+) would require an RWIN of 150 MB (150,000,000 bytes).

Increase Link Bandwidth

The most effective solution for increasing network capacity is to increase WAN link bandwidth. The most over-utilized links are often the WAN circuits where company traffic traverses to access the data center. The company WAN is deployed with much lower bandwidth than the campus network. For instance consider a fast WAN link such as a T3 circuit (45 Mbps). That is approximately 20 times less bandwidth than a campus Gigabit uplink. The result of over-utilized WAN links is often increased queuing delays and packet loss. The typical campus network is designed with GE (1000 Mbps) and 10 GE (10000 Mbps) uplinks at all layers.

Fabric Enabled Line Cards

The fabric enabled line cards use Cisco Express Forwarding with a FIB and adjacency table on the PFC card for high speed switching. The performance of the dcef720 line card is optimized when deployed on a Cisco 6500 chassis with a Supervisor Engine 720. The performance is optimized with a 40 Gbps (2 x 20 Gbps) switch fabric channel connection and an onboard DFC module. The forwarding rate per line card is 48 Mpps and 720 Mpps per chassis. The aggregate fabric switching capacity is 720 Gbps. Some dCEF720 line cards include Cisco WS-X6708-10GE-3C and the WS-X6716-10GE-3C. There is an ASIC for every 2 switch ports with the WS-X6708-10GE-3C for a low oversubscription of 2:1 and a single ASIC for every 4 switch ports with the WS-X6716-10GE-3C for a 4:1 oversubscription.

Cisco Express Forwarding (CEF)

Routing of packets in software with the route processor is much slower and processor (CPU) intensive than hardware forwarding. Cisco Express Forwarding does the Layer 2 and Layer 3 switching packets in hardware. This feature is supported on most Cisco routers and multilayer switches for optimizing performance. The MSFC (route processor) builds the routing table in software (control plane) and derives an optimized routing table called a FIB from that. The FIB is comprised of a destination prefix and next hop address. The FIB is pushed to the PFC forwarding engine (data plane) and any DFC for switching of packets in hardware. The MSFC builds a Layer 2 adjacency table as well comprised of the next hop address and MAC address from the FIB table and ARP table. The adjacency table is pushed to the PFC and any DFC modules as well. There is a pointer from the FIB table entry to a Layer 2 adjacency table entry for all necessary packet forwarding information. The rewriting of Layer 2 frame and Layer 3 packet information occurs before forwarding the packet to the switch port queue. The MSFC updates the routing table when any routing changes occur. The MSFC then updates the FIB and adjacency table and pushes those changes to PFC and DFC modules. There are some network services that cannot be hardware switched and as a result must be software switched with the route processor.

Server Load Balancing

The primary vendors that have server load balancing solutions for the enterprise include F5 and Cisco. The F5 load balancer appliance is called BIG IP Local Traffic Manager (LTM). LTM is an application proxy load balancer with distributed performance optimization and high availability Server load balancers are used to optimize the available capacity across all servers. In addition latency is decreased by selecting servers based on performance metrics. Various models include 1600, 2000, 3600 and 3900 series appliances. The 4000, 6900, 8900, 10000 and 11000 appliances have an add-on module option. The Virtual Edition is available for VMware and Microsoft hypervisor software. The following is a summary of F5 BIG IP LTM features. Some routing protocols allow for load balancing of traffic across equal and unequal cost links as well.

Routed Access Layer

Companies are starting to migrate routing to the campus access layer. The traditional campus multilayer model does routing at the distribution and core switches. The advantages of a routed access layer are faster convergence with routing protocols, load balancing and ease of management. The traditional Layer 2 / Layer 3 boundary is at the distribution switch. The access layer with the traditional model uses STP for convergence and to maintain a Layer 2 loop free topology. Deploying a routing protocol such as OSPF or EIGRP at the access layer switches provides for faster convergence with a more deterministic design and load balancing. The routed access layer uses Layer 3 links equal cost links to the distribution layer and ECMP for load balancing. Convergence is now provided by the routing protocol as with the multilayer distribution and core switches. In addition ECMP will provide failover for equal cost switch links instead of any routing convergence. Spanning of VLANs across multiple access switches isn't permitted with routed access layer design. To span multiple switches use Cisco 6500 switches with VSS for connectivity to the access switches. The default gateway is now the multilayer access switch eliminating the need for a first hop redundancy protocol such as HSRP. The access switch VLANs that typically terminate at the distribution switch now terminate at the access switch. The ip helper-address command is moved from the distribution switch to the multilayer access switch for DHCP relay services. Each point to point uplink from multilayer access switches to distribution switches are deployed with /30 subnet masks. The convergence event is comprised of detecting the link or node failure, selecting a new route path and updating routing and CEF tables. The OSPF and EIGRP convergence time for a link or node failure can be optimized with the following recommendations for campus switching.

No Oversubscription

Network switches have three primary types of oversubscription. They include ASIC, switch fabric and uplink. ASIC oversubscription is determine by the number of switch ports assigned to each ASIC. The ASIC forwards packets between the line card and the switch fabric. The line card that has no oversubscription (1:1) has a single ASIC for each switch port. The switch port isn't sharing the ASIC link with other ports to the switch fabric and as a result packet loss isn't possible. An example of this is the WS-X6704 line card with 4 switch ports and 4 ASICs. Switch fabric oversubscription occurs when the line card aggregate port capacity is greater than the connection to the switch fabric. The actual switch fabric channel speed varies with each line card. The switch fabric that has no oversubscription is called non-blocking. That occurs with line cards that have aggregate port capacity less than or equal to fabric connection. The switch uplink oversubscription is determined by the ratio of the switch ports or line card aggregate capacity to the switch uplink capacity. The oversubscription of switch uplinks applies to all switches forwarding traffic. Switch uplinks have the most oversubscription of any components. For instance a 48 port 3750X access switch will typically use a single Gigabit uplink. That is a 48:1 oversubscription of traffic between the access switch ports and the GE uplink. Uplink oversubscription increases with the Cisco 4500 and 6500 switches that have multiple line cards sharing what is sometimes two GE uplinks to a core switch. The migration to 10 GE uplinks with EtherChannel is being deployed to decrease uplink oversubscription.

Quality of Service (QoS)

The purpose of implementing quality of service (QoS) is to allocate the available network bandwidth to various traffic classes for the purpose of managing performance and optimizing bandwidth usage. The default network queuing is First In First Out (FIFO) queuing. The ingress and egress packets are queued to FIFO queues as they arrive. They are then forwarded to the interface hardware ring. There is no prioritization of packets or assignment of traffic classes with FIFO. Deploying QoS won't necessarily prevent packet loss on a network that requires additional bandwidth. QoS does not increase the amount of aggregate bandwidth available to network traffic. What it does is manage the available bandwidth by assigning it to various traffic types. It merely decides what packets are prioritized and how packets are dropped during times of network congestion. This is important for delay sensitive voice and video traffic. It is possible as well to prioritize (classify) data according to specific business requirements and mark down bulk traffic and Internet traffic as well. Cisco QoS is available with various techniques for managing network traffic. The use of QoS applies only in the context of minimizing the effects of network congestion. It is implemented as part of a performance management strategy. Some of the most popular QoS tools include packet classification and marking, low latency queuing, traffic shaping, rate limiting and policing. The correct techniques for packet classification, marking, queuing and traffic shaping must be selected to improve network performance. The performance requirements should determine the strategies employed for prioritizing and managing traffic. Cisco QoS best practices are recommended for deployment to your network infrastructure. Consider doing a network assessment that analyzes network design, device platforms, current performance issues and required SLAs before deploying QoS. It is important as well to deploy QoS only where it is needed and not to over manage traffic. Start at the access layer and only deploy necessary QoS as you move toward the network core. Maintain markings through all transit devices and focus your QoS on WAN links where bottlenecks often occur. In addition consider other performance strategies for improving performance in addition to QoS tools. QoS can help alleviate performance problems where there are link mismatches such as WAN links. Policing and traffic shaping can manage oversubscription problems however it is preferable to fix the oversubscription issues with upgrades and network design changes.

Cisco Multichassis Link Aggregation

The purpose of Cisco Multichassis link aggregation is to create a single logical chassis from multiple switches. That creates a single shared control plane and data plane. The affect is increased switching throughput and uplink throughput from access switches. In addition the single logical topology eliminates the need for STP and minimizes unicast and multicast traffic. The virtual chassis optimizes traffic flows between the access layer and distribution layer. The primary Cisco techniques include Stacking, Virtual Switching System (VSS) and Virtual Port Channel (vPC). The 3750 Switches employ switch stacking while the 6500 switches use VSS and Nexus switches use vPC. The following is a description of the Multichassis Link Aggregation techniques available with Cisco switches.

OSPF Graceful Restart

The router performing a graceful restart uses stateful switchover (SSO) with Non-Stop Forwarding (NSF) to minimize failover and convergence time. Cisco routers and switches have separate control and data planes. The data plane forwards packets while the control plane manages routing and control protocols. The primary and standby route processors synchronize state tables to optimize failover time. All routers have a route processor. The route processor of a multilayer switch is the Supervisor Engine. The purpose of SSO is to dynamically synchronize stateful information from primary to standby route processors. This includes all components including the CEF FIB and adjacency tables, Layer 2 control protocols and configuration files. Anytime there is a change to any state information the standby route processor is updated. This allows for 0 to 3 second dynamic switchover to the standby route processor when the primary route processor fails.

Minimize WAN Protocols

Minimizing protocol handoffs across the company network will decrease processing delay, interface errors and QoS mapping. In addition fewer encapsulations between different Campus/WAN protocols will increase throughput. For example a router with Ethernet and serial interfaces will have to strip off the Ethernet header and encapsulate packets with a serial header before forwarding across the serial link. Metro Ethernet forwards packets using standard Ethernet encapsulation. Deploying Metro Ethernet is more advantageous than multiple WAN protocols. For increased distance between branch offices and the data center, standardize on Metro Ethernet and Packet over SONET (PoS). That is preferred over multiple TDM and Frame Relay services.

Cisco WAAS Optimizer

The WAAS appliances are deployed on WAN links for optimizing bandwidth and accelerating application traffic. There are a variety of WAAS platforms with features and performance ratings designed for each office and traffic profile. The newer models are called Wide Area Virtualization Engines (WAVE) that use Cisco WAAS software. Cisco WAVE 294 and WAVE 594 are appliances for the branch office. The Cisco WAVE 694 and WAVE 7471 appliance are deployed at distribution and core office WAN links. The Cisco WAVE 7571 and WAVE 8541 are data center appliances. The maximum recommended WAN link speed is based on the appliance maximum optimized throughput.

Jumbo Frames

Jumbo frames are supported on some Cisco switch and router platforms. The 9000 byte jumbo frame substantially decreases network device utilization (processing). In addition performance is optimized with increased packet efficiency and fewer ACKs required per session. The Unix NFS protocol used for file sharing uses 8192 byte read/write data blocks. This is a specific advantage for Unix servers however all equipment between source and destination must support jumbo frames. Fragmentation occurs at network devices that don't support jumbo frames. Deploying TCP Offload at the server network interface card is recommended to process the larger frame size more effectively. Jumbo frames are standard with Cisco Gigabit and 10 Gigabit interfaces.

802.11n Standard

The new 802.11n wireless standard approved in 2009 defines much faster data rates of 300 to 600 Mbps from wireless client to access point and 1000 Mbps from access point to network switch increasing throughput from client to access point and access point to network switch. It operates in both the 2.4 GHz and 5 GHz bands with effective new performance enhancements such as multiple input multiple output (MIMO) antenna and channel bonding.

Device Memory

Network devices use memory for various purposes and memory utilization is a key performance metric. The device peak memory utilization should not exceed 80% of total memory and not exceed an average 70% for a 5 minute interval. Deploying the most amount of memory available is a best practice for all network devices and servers to optimize performance.

SSD Network Server Drives

The disk subsystem is defined as the disk drives, controller hardware and software used to manage disk operations. The disk drives are used to store application software, operating systems and employee data files. The disk drive is most often the component of a network server with the highest latency compared with memory, CPU and network interface card. The primary disk types available today include Serial Advanced Technology Attachment (SATA), Serial Attached SCSI (SAS) and Solid State Drives (SSD). Note that SATA and SAS refer to actual protocol specifications for managing disk data transfer. SATA drives are high capacity (disk space) low cost drives with the worst performance. They are deployed to branch offices, SMB applications and backup servers where latency isn't a factor. The SAS drive performance is improved with higher throughput, lower latency and higher IOPS than SATA drives. The cost per GB of disk space is higher than SATA drives and there is typically less disk capacity. The SAS drive is enterprise class and deployed for data center office applications, middleware servers, web applications and small databases. The enterprise market is deploying SSD drives for optimized performance and capacity. The SSD drive are available with SAS interfaces and SATA interfaces. There are no moving parts with SSD and as a result they have the lowest latency and access time. The drive is actually comprised of persistent flash memory and has the highest throughput (IOPS) of any drive. The SSD drive is however the most expensive drive per GB of disk space. Companies will select SSD drives for only data center server farms where key applications reside. That would include the busiest data center file servers, large databases, virtualization, java applications and cloud applications. Today most of the data storage servers are centralized at the data center. The typical storage area network (SAN) network is designed to integrate various components that maximize throughput and capacity while being cost effective.

Performance Routing (PfR)

The purpose of performance routing (PfR) is to optimize available bandwidth and best path selection for packet forwarding across the company WAN. Most companies today have deployed backup links and sometimes multiple links for WAN connectivity. Performance routing provides for effective load balancing to maximize available bandwidth. In addition there is dynamic best path selection based on granular real time monitoring of performance metrics.

Network Redundancy

This refers to the aggregate fault tolerance of a network at all layers of the OSI model. That starts at the physical layer with link redundancy up to the application layer with server clustering and load balancing. Most companies today specify what amount of uptime they require for effective business operations. This is expressed as an annual percentage SLA. Most enterprises will target somewhere between 97 to 99.99% uptime not including planned outages. There are change management windows defined for various planned outages. There is link, module, default gateway, router, firewall, circuit, ISP, telco, data, server and power redundancy.

Application Tuning
Improper application tuning is a contributing factor to server processing delays. There is latency that occurs with each disk access. The application should read and write large data blocks and assign properly sized memory for application queues. There are recommendations from the application vendor for tuning TCP protocol features to optimize performance. It should be noted that some application developers write their application to manage some TCP settings. The best practices recommendation is to let the TCP stack manage that. It is recommended to optimize SQL requests to fewer larger blocks of data and index the database to minimize server processing. Enabling Nagle algorithm on a fast, low latency network causes delayed ACKs to increase latency and slow application response time. Real time applications should not use Nagle either. The use of Nagle is a safeguard against badly written applications that write small packets. The optimized solution is to increase application data block read/write size from disk and data block read/writes to the TCP stack. Nagle should be enabled for low bandwidth links (< 256 Kbps) where required for specific applications that write small packet sizes.

Bidirectional Forwarding Detection (BFD)
This is a newer link failure detection protocol used with Layer 3 routing protocols at routers for rapid detection of a link or node (router) failure. The BFD protocol is configured on each router where the link status is monitored. The BFD protocol sends hello packets to its neighbor router and when a link or node failure occurs, it is detected faster than the routing protocol. The routing protocol is notified by the BFD process to start route convergence immediately. Cisco Express Forwarding must be enabled on the routers.

Design Bottlenecks
Some of the most fundamental network performance problems occur with design bottlenecks. They include link bottlenecks, module and device bottlenecks. Link bottlenecks often occur at WAN circuits and server to switch uplinks. Module bottlenecks can occur at distribution and core switches. Device bottlenecks often occur at aggregation WAN routers and distribution switches.

Routing Protocol Tuning

The default OSPF hello packet interval is 10 seconds for Ethernet and 30 seconds for serial WAN links. Hello packets are sent to neighbor routers at regular intervals to detect link or node failure. The OSPF Fast Hello feature now supports subsecond hello packets intervals. This is possible with the dead timer and using the dead timer multiplier. The dead timer is 4 times the value of the hello timer and used by OSPF to declare a route as unavailable. The new minimum value of the dead timer is 1 second. The dead timer multiplier can be configured to create subsecond hello packets. For instance setting the dead timer to 1 second with a multiplier of 4 creates 250 msec hello packets. Hello and dead timer settings must match across the network.

Web Page Optimization

The following are best practices for decreasing the size of web pages.
• Configure cache control to optimize bandwidth usage to cache web pages to the local browser cache.
• Minimize the size of graphic (picture) files
• Persistent connections allows all HTTP and HTTPS (SSL) new requests to reuse a single TCP connection. The advantage is they don't have to do start new TCP connection handshake and send the connection overhead for each new request.
• Enable persistent caching on client web browser for non HTML content such as CSS, images and JavaScript. The HTML user data is not cached.
• Deploy proxy server to improve performance by caching HTTPS static pages with shared non-user data.
• Encrypt packets in hardware for faster processing with hardware accelerated encryption used by router modules.
• Optimize bandwidth usage with Gzip dynamic HTML compression.

Surjeet Singh · ‎09-15-2015

Nice doc for network engineer and Designer both !!

Regards,

Surjeet