cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
23770
Views
15
Helpful
0
Comments
Sandeep Singh
Level 7
Level 7

 

 

Introduction

This document explains some best practices and common mistakes observed in UCS setups and provides recommendations. This list is not exhaustive but is useful for any UCS deployment.

 

UCS Best Practices

Networking:

 

  • Use End Host Mode (EHM) where possible to allow for simple deployment methodologies
  • Utilize default pinning (round robin) for servers to uplinks
  • Leverage host-based methods to enable traffic that is east-west in nature to be switched locally on an FI
  • Use appropriate MTU sizes
  • Use Port Channels for connectivity to upstream resources; pinned server interfaces will take advantage of the Port Channel hashing algorithm
  • Match native VLAN tags between the Fabric Interconnect and northbound switch ports; it is not necessary that this tag match the vnic trunk parameters
  • Prune unnecessary VLANs from upstream interfaces
  • Enable STP Port Fast and type “edge trunk” on upstream switch interfaces
  • Use meaningful names for VLANs within UCS
  • Reserve several VLANs for use internally for FCoE traffic
  • Utilize System QoS to prioritize traffic between environments at the vnic level; Guarantees minimums but does not rate-limit those interfaces

 

Storage:

 

  • Leverage Boot from SAN to enable a stateless infrastructure
  • Utilize default pinning (round robin) for server vHBA to uplinks
  • Leverage Boot Policies to distribute initial boot traffic across backend storage array ports
  • Use VSAN names and IDs that correlate to upstream fabrics (where applicable)
  • Leverage redundant SAN fabric switches
  • Use host-based multipathing for redundancy and load-balancing
  • Pre-provision SAN resources where applicable to reduce deployment time and enhance stateless infrastructure

 

Compute:

 

  • Use Service Profiles to manage compute resources logically
  • Map Service Profiles to blades with a deterministic approach; By distributing compute resources across I/O modules and chassis uplinks, fault domains are limited and better utilization can be achieved
  • Leverage a deterministic logical addressing scheme for UUID, WWNN/WWPN, and MAC addresses to represent a given compute resource
  • Name Service Profiles to match a particular use case
  • Use policies such as Boot policies for service profiles
  • Leverage management and host firmware packages to manage needs on a per-Service Profile basis

 

Management:

 

  • Manage orchestration aspects through API scripts and tools, enabling a consistent methodology across UCS instances
  • Build UCS instances through provisioning scripts, allowing for simple, rapid, and consistent deployments of new UCS

 

Observations and Recommendations

Observation1: Chassis uplink connections mismatched, chassis 2, 3 and 5 use discrete links while 1 and 4 use port channels

 

Explanation: Chassis uplinks were originally configured to use link-grouping, which creates port channels between the fabric interconnects and the chassis. Later this policy was changed to use discrete links. Three chassis have been reacknowledged since then resulting in 3 chassis with discrete links while the others still use port channels.

 

Recommendation: Standardize on using port channels for each chassis.

 

Observation2: MAC address pools are not defined separately for the A side and B side fabrics

 

Explanation: MAC address pools should be created with separate address ranges for the A side fabric and the B side fabric per Cisco best practices.  Separate address ranges can simplify troubleshooting measures by making identification of vNICs in the environment easier via a unique bit which specifies the A or B side vNIC

 

Recommendation: Create separate MAC address pools and reconfigure the existing vNICs to use them

 

Observation3: Empty default pools exist creating faults

 

Explanation: Empty default pools are included in UCS Manager in the out-of-the-box configuration. Since these pools cannot be renamed customers often create custom pools for WWNs, MAC addresses, server pools along with iSCSI IP addresses and IQNs. A fault is generated and left until the default pools either have entries or they are removed.

 

Recommendation: Remove the empty default server, MAC, WWN, iSCSI IP and iqn pools

 

Observation4: External authentication is not used

 

Explanation: While TACACS+ servers are defined in the system it appears that external authentication is not in use after examining the audit logs. External authentication is recommended for role based access and to capture the name of the administrator making changes in the audit logs sent via Syslog.

 

Recommendation: Configure external authentication using an existing LDAP infrastructure, TACACS or RADIUS. Verify logon usernames are logged in the audit logs that are sent via syslog.

 

Observation5: Blade 3/5 DIMMs 29 and 30 missing or invalid

 

Explanation: A blade server has missing or invalid memory DIMMs resulting in the physical blade being marked as inoperable. The DIMMs may not be seated properly or another hardware issue may be present. The issue needs to be repaired so the blade and its associated service profile can operate normally.

 

Recommendation: Next steps to be taken with TAC, DIMM replacement

 

Observation6: Chassis links are distributed across ASICs on the fabric interconnects

 

Explanation: Chassis cabling has been distributed across the fabric interconnect ports so that each link of the pair is connected to a port that has a distinct back-end processor chip within the fabric interconnect. This configuration does not follow best practices for connectivity using chassis port channels

 

Recommendation: Move the links from the separated ports to ports within the same ASIC

 

Observation7: Callhome, SNMP, Syslog and SEL policy monitoring are not properly configured

 

Explanation: Monitoring tools built into UCS are not being utilized properly or are not configured.  Callhome, which sends proactive email alerts is not turned on or configured.  SNMP traps are not being sent to a monitoring station. SEL logs from the blades are not captured and stored centrally. These items are critical components necessary for monitoring the systems for faults and troubleshooting errors that occur

 

Recommendation: Configure Callhome for proactive email alerts of faults and errors. Configure SNMP traps and compile SNMP MIBs for UCS to properly parse SNMP traps for proactive notification of faults and errors. Direct system audit log output to a remote collector via Syslog. Configure the SEL policy to capture blade SEL logs for troubleshooting and automatic clearing of the logs when full

 

Observation8: BIOS policies are using all platform-default settings

 

Explanation: A BIOS policy is defined for the existing service profile templates but it does not change any settings from the platform-default of the blade. Cisco can provide specific recommendations for BIOS settings based on the OS installed. These settings should be set in appropriate BIOS policies.

 

Recommendation: Configure BIOS policies to enable/disable features according to Cisco best practices

 

Related Information

Frequently Asked Questions for UCS

Troubleshooting and Clearing UCS Faults

UCS Failure Scenarios Testing using CLI

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: