UCS Best Practices and Common Recommendations for Configuration and Management

Sandeep Singh · ‎03-31-2013

Introduction

Introduction

This document explains some best practices and common mistakes observed in UCS setups and provides recommendations. This list is not exhaustive but is useful for any UCS deployment.

UCS Best Practices

Networking:

Use End Host Mode (EHM) where possible to allow for simple deployment methodologies
Utilize default pinning (round robin) for servers to uplinks
Leverage host-based methods to enable traffic that is east-west in nature to be switched locally on an FI
Use appropriate MTU sizes
Use Port Channels for connectivity to upstream resources; pinned server interfaces will take advantage of the Port Channel hashing algorithm
Match native VLAN tags between the Fabric Interconnect and northbound switch ports; it is not necessary that this tag match the vnic trunk parameters
Prune unnecessary VLANs from upstream interfaces
Enable STP Port Fast and type “edge trunk” on upstream switch interfaces
Use meaningful names for VLANs within UCS
Reserve several VLANs for use internally for FCoE traffic
Utilize System QoS to prioritize traffic between environments at the vnic level; Guarantees minimums but does not rate-limit those interfaces

Storage:

Leverage Boot from SAN to enable a stateless infrastructure
Utilize default pinning (round robin) for server vHBA to uplinks
Leverage Boot Policies to distribute initial boot traffic across backend storage array ports
Use VSAN names and IDs that correlate to upstream fabrics (where applicable)
Leverage redundant SAN fabric switches
Use host-based multipathing for redundancy and load-balancing
Pre-provision SAN resources where applicable to reduce deployment time and enhance stateless infrastructure

Compute:

Use Service Profiles to manage compute resources logically
Map Service Profiles to blades with a deterministic approach; By distributing compute resources across I/O modules and chassis uplinks, fault domains are limited and better utilization can be achieved
Leverage a deterministic logical addressing scheme for UUID, WWNN/WWPN, and MAC addresses to represent a given compute resource
Name Service Profiles to match a particular use case
Use policies such as Boot policies for service profiles
Leverage management and host firmware packages to manage needs on a per-Service Profile basis

Management:

Manage orchestration aspects through API scripts and tools, enabling a consistent methodology across UCS instances
Build UCS instances through provisioning scripts, allowing for simple, rapid, and consistent deployments of new UCS

Observations and Recommendations

Observation1: Chassis uplink connections mismatched, chassis 2, 3 and 5 use discrete links while 1 and 4 use port channels

Explanation: Chassis uplinks were originally configured to use link-grouping, which creates port channels between the fabric interconnects and the chassis. Later this policy was changed to use discrete links. Three chassis have been reacknowledged since then resulting in 3 chassis with discrete links while the others still use port channels.

Recommendation: Standardize on using port channels for each chassis.

Observation2: MAC address pools are not defined separately for the A side and B side fabrics

Explanation: MAC address pools should be created with separate address ranges for the A side fabric and the B side fabric per Cisco best practices. Separate address ranges can simplify troubleshooting measures by making identification of vNICs in the environment easier via a unique bit which specifies the A or B side vNIC

Recommendation: Create separate MAC address pools and reconfigure the existing vNICs to use them

Observation3: Empty default pools exist creating faults

Explanation: Empty default pools are included in UCS Manager in the out-of-the-box configuration. Since these pools cannot be renamed customers often create custom pools for WWNs, MAC addresses, server pools along with iSCSI IP addresses and IQNs. A fault is generated and left until the default pools either have entries or they are removed.

Recommendation: Remove the empty default server, MAC, WWN, iSCSI IP and iqn pools

Observation4: External authentication is not used

Explanation: While TACACS+ servers are defined in the system it appears that external authentication is not in use after examining the audit logs. External authentication is recommended for role based access and to capture the name of the administrator making changes in the audit logs sent via Syslog.

Recommendation: Configure external authentication using an existing LDAP infrastructure, TACACS or RADIUS. Verify logon usernames are logged in the audit logs that are sent via syslog.

Observation5: Blade 3/5 DIMMs 29 and 30 missing or invalid

Explanation: A blade server has missing or invalid memory DIMMs resulting in the physical blade being marked as inoperable. The DIMMs may not be seated properly or another hardware issue may be present. The issue needs to be repaired so the blade and its associated service profile can operate normally.

Recommendation: Next steps to be taken with TAC, DIMM replacement

Observation6: Chassis links are distributed across ASICs on the fabric interconnects

Explanation: Chassis cabling has been distributed across the fabric interconnect ports so that each link of the pair is connected to a port that has a distinct back-end processor chip within the fabric interconnect. This configuration does not follow best practices for connectivity using chassis port channels

Recommendation: Move the links from the separated ports to ports within the same ASIC

Observation7: Callhome, SNMP, Syslog and SEL policy monitoring are not properly configured

Explanation: Monitoring tools built into UCS are not being utilized properly or are not configured. Callhome, which sends proactive email alerts is not turned on or configured. SNMP traps are not being sent to a monitoring station. SEL logs from the blades are not captured and stored centrally. These items are critical components necessary for monitoring the systems for faults and troubleshooting errors that occur

Recommendation: Configure Callhome for proactive email alerts of faults and errors. Configure SNMP traps and compile SNMP MIBs for UCS to properly parse SNMP traps for proactive notification of faults and errors. Direct system audit log output to a remote collector via Syslog. Configure the SEL policy to capture blade SEL logs for troubleshooting and automatic clearing of the logs when full

Observation8: BIOS policies are using all platform-default settings

Explanation: A BIOS policy is defined for the existing service profile templates but it does not change any settings from the platform-default of the blade. Cisco can provide specific recommendations for BIOS settings based on the OS installed. These settings should be set in appropriate BIOS policies.

Recommendation: Configure BIOS policies to enable/disable features according to Cisco best practices

Related Information

Frequently Asked Questions for UCS

Troubleshooting and Clearing UCS Faults

UCS Failure Scenarios Testing using CLI