CiscoLive! Europe just came to a close at the ExCel London. The show network was an entire Cisco-on-Cisco affair. Cisco IT built a pure Cisco data, voice, and video network including full-on IPv6, IP video surveillance complete with analytics, and green technology based on Energywise. Managing all of this was an integrated Cisco NMS solution. In this blog post, I will tell my tale of the Cisco-on-Cisco network management story at CiscoLive! London.
I arrived at the ExCel London on Sunday, January 30 around noon. I met up with three of my NMS colleagues, Tejas Shah (LAN Management Solution Technical Marketing Engineer), Stuart Parham (Consulting Systems Engineer), and Cengiz Savas (Systems Engineer). While the network (including the network management) had been pre-staged, we needed to bring the NMS servers online. We proceded into the bowels of the convention center to the IDF that held our make-shift data center. There was a rack with multiple UCS servers. These servers housed important services such as the Cisco Network Registrar DNS and DHCP server, Cisco Secure ACS server, and the Energywise Orchestrator.
One of the UCS C200 servers housed our main network management applications. This server was running VMWare ESX 4.1 with virtual machines for CiscoWorks LAN Management Solution 4.0 and Cisco Unified Communication Management Suite 8.5 (pre-release). The server had two eight-core CPUs with 16 GB of RAM. Each VM ran Windows 2003 Server and was assigned one vCPU and 4 GB of RAM. We had some spare VMs just in case. Since we had two NMS servers running on the same physical server we wanted to configure ESX with one management interface for kernel activity, and a port channel with two gigabit ethernet ports for VM network activity. We also needed to move the VMs into the out-of-band management VLAN. This was a secure VLAN only allowing local traffic.
Just behind the server rack, on the far wall of the data center, was the full network topology diagram.
The network design was a multi-tier core, distribution, and access in a non-blocking configuration. Each device had an IP address in the secure management VLAN. All told, there would be 140 switches , two Cisco Unified Communication Managers, and about 60 phones to be managed throughout the convention center. No problem for LMS and CUCMS.
Once the UCS server was properly configured on the management VLAN, we proceeded to the pre-show Network Operations Center to complete the setup. When we arrived at the NOC, we saw that the team had already setup their own NMS solution with PRTG and Kiwi Syslog . We had our work cut out for us to convince the network staff that we had something better.
We got to work configuring LMS and CUCMS. The first task was to discover the network. We obtained the SNMP read-only community string for the switches, then configured LMS Discovery using the CDP module. The first pass found quite a few devices, but took a very long time to complete. We found the reason for this was that some devices still had "public" configured as the community string. We adjusted the Discovery credentials so that we could pick up these errant devices, and then we re-ran Discovery. This time, it completed more quickly.
Now that we had the network discovered, we needed to start fixing some of the problems we had noticed. Besides the community string mismatches, the LMS server was not configured as a syslog host. Unfortunately, we did not have any read-write credentials. We still needed to convince the network staff that we could provide value. Fortunately, there is a lot you can do with read-only SNMP credentials. I went to the LMS Fault Monitor, and noticed we had found some events on the network already. Some of them looked rather serious.
We were seeing that some of the Xenpaks on the core VSS were experiencing a high temperature alert. At the same time, one of the network engineers sitting next to us was trying to correlate devices to serial numbers (something LMS could do with ease). Tejas exported a custom report correlating device names to serial numbers and gave it to the engineer. He then took the temperature alert information to the network team, and they were very interested in LMS. Tejas then started to explain what else we could do if we had full read-write access. He talked about configuration archival, compliance management, and the ability to quickly deploy configuration changes to all devices. They agreed to provide the credentials and look into the problems that LMS had already started to report.
No sooner that we had read-write access to the devices, than the network team wanted our help. Besides the problematic SNMP credentials, some of the devices did not have the proper TACACS+ secret key configured. To find those devices, we configured a baseline compliance template to make sure all devices had the following global configuration:
+ tacacs-server key [KEY]
We ran a compliance check and found four devices to be non-compliant. A few clicks later, and we had the compliant configuration pushed down to those devices.
Meanwhile, June Zheng (CUCMS product manager) worked on setting up CUCMS. She discovered the two CUCMs, the MGCP gateway, and phones that were currently online. CUCMS performed perfectly. It found all of the devices without any issues, and started monitoring each node for problems.
Back to LMS, we wanted to make sure LMS could see all of the network events, so we deployed a job to add the LMS server's IP as a syslog server to each device. For this, we walked one of the network team engineers through configuring Netconfig's syslog template to add the new logging host. It was a good opportunity to show off the ease and power of LMS.
It was around this time the network team started to notice some problems with multicast. Given all of the video at CiscoLive!, they were generating a lot of multicast and broadcast traffic. This was triggering storm events on the switches. They needed to deploy modified multicast and broadcast storm control policies to all access ports and all distribution uplinks. This sounded like another perfect job for baseline compliance. We confirmed that all access ports had the command switchport mode access configured. Because of this, we defined the following baseline prerequisite for our multicast template:
Sub-mode: interface [#((Fast)|(Gigabit))Ethernet.*#]
+ switchport mode access
For ports that match that prerequisite, we would apply the following template:
+ storm-control multicast level pps 2000
+ storm-control broadcast level pps 2000
For the distribution uplinks, we had to remove the existing statically configured policy. We created the following template to do that:
Sub-mode: interface [#((Fast)|(Gigabit)|(Tengigabit))Ethernet.*#]
- storm-control multicast level 1.00
We ran the compliance report, then applied the required commands to the non-compliant devices. At this point, it was very close to midnight, and I worried that I would not be able to catch the train back to the hotel. I packed up to leave, but Tejas, who was staying at a hotel within walking distance, remained to continue to refine the network configuration.
Remember those devices with the community string of "public"? Tejas used LMS to deploy compliant configurations to those devices so that they received the correct community strings. Even though the management network was restricted, we wanted to be extra careful when it came to management access, so Tejas also deployed an ACL for the community strings. When it was all said and done, the configuration (for the access switches at least) looked like the following.
no service pad
service timestamps debug datetime msec
service timestamps log datetime msec
no service password-encryption
enable secret 5 <removed>
aaa authentication login default group tacacs+ local
aaa session-id common
system mtu routing 1500
vtp domain networkers
vtp mode transparent
ip domain-name events-cisco.com
ip name-server 172.16.14.5
ip dhcp snooping vlan 2-4,6-13,15,19,21-25,30-38,42-43
no ip dhcp snooping information option
ip dhcp snooping
energywise domain clive01 security shared-secret 0 <removed>
energywise role switch
energywise management security shared-secret 0 <removed>
energywise allow query save
energywise endpoint security shared-secret 0 <removed>
crypto pki trustpoint TP-self-signed-3891214336
crypto pki certificate chain TP-self-signed-3891214336
certificate self-signed 01
errdisable recovery cause bpduguard
errdisable recovery interval 30
spanning-tree mode rapid-pvst
spanning-tree extend system-id
vlan internal allocation policy ascending
interface range FastEthernet0/1 - 47
description *** Access Port ***
switchport access vlan 10
switchport mode access
switchport port-security maximum 5
storm-control broadcast level pps 1k
storm-control multicast level pps 2k
no cdp enable
no cdp tlv server-location
no cdp tlv app
spanning-tree bpduguard enable
description *** To Distribution Switch ***
switchport trunk encapsulation dot1q
switchport mode trunk
ip dhcp snooping trust
no ip address
description *** OOB Management Interface ***
ip address 172.100.100.X 255.255.255.0
ip default-gateway 18.104.22.168
ip http server
ip http secure-server
ip tacacs source-interface Vlan250
ip access-list standard RESTRICT_SNMP
permit 22.214.171.124 0.0.0.255
ip sla enable reaction-alerts
snmp-server community <removed> RW RESTRICT_SNMP
snmp-server community <removed> RO RESTRICT_SNMP
tacacs-server host 126.96.36.199
tacacs-server key <removed>
banner motd ^C
## Cisco CPOC Networkers Team ##
## UNAUTHORIZED ACCESS IS PROHIBITED ##
## All sessions to this device are being monitored. ##
## If unauthorized access is detected, your address ##
## will be logged and the authorities will be ##
## notified to take appropriate actions. ##
line con 0
line vty 0 4
line vty 5 15
ntp clock-period 36029214
ntp server 188.8.131.52
The hub for network operations was an octagon-shaped "fishbowl" NOC in the middle of the World of Solutions. On one side was the entrance to the NOC. One side was setup with a glass window and cute signs like "Don't feed the engineers." On the remaining six sides were monitors highlighting the network management applications being used to manage the production show network. Let's take a brief stroll around the NOC.
Here I am next to the entry way.
One of the monitors was showing a slide show of the Cisco-on-Cisco story highlighting the technologies being used in the NOC at in the network.
The next monitor was showing CUCMS managing our CUCM cluster. This cluster consisted of one publisher, one subscriber, and one MGCP gateway. The cluster serviced all of the public phones in the venue. We were offering customers free five minutes of calling to anywhere in the world.
The IPv6 team was using Munin to measure IPv6 end hosts and traffic. CiscoLive! London set records in terms of the number of IPv6 end hosts. Because operating systems like Mac OS X and Windows 7 have IPv6 enable by default, users were getting IP addresses, and accessing the Internet never knowing the difference.
Not to be outdone, we setup a poller in LMS to monitor traffic on the IPv6 router.
Our next stop is Energywise. In addition to using LMS to manage Energywise, we had Energywise Orchestrator running monitoring the power consumption of the network. While we were not powering off devices or ports, we were able to measure the overall power consumption of the show.
That brings us to LMS. More on the uses of LMS later.
Finally, you cannot do a trade show like this without wireless. The wireless team was using a controller-based wireless network and managing it with the Wireless Controller Service (WCS). Using WCS, they were able to spot areas of weak coverage and isolate interference issues.
Throughout the show, we regularly relied on LMS to report faults, and measure capacity. At one point, the IP video surveillance team reported that they were getting choppy or hanging video from one set of access switches hanging off of a particular distribution switch. Using Topology Services in LMS, we traced the layer 2 path from the switch to which the camera connected to one of the hosts reporting the problem. We then used Netshow to look at errors on each of the ports. This showed that one of the uplink ports was reporting thousands of output errors. Replacing the Xenpak on this port corrected the problem. After that, we setup a performance poller to watch the distribution uplinks for errors.
For capacity, we monitored the number of end hosts in User Tracking. Since we were not using dynamic User Tracking, we had to rely on acquisitions. The peak number of end hosts recorded for the show was around 5800.
Cisco network management stood up well at CiscoLive!. We were able to use it to find faults, maintain consistent configurations, and track capacity. By adopting a Cisco-on-Cisco methodology for shows like CiscoLive! we show off the power of our products as well as see areas where we can continue to improve. I am looking forward to taking what I've learned to the my next CiscoLive! in Las Vegas. I hope to see you there.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.