Solved: Cisco Collapsed Core - Network Issues

Nicholas Beard · ‎08-16-2011

Hi guys,

I have been tasked with auditing an existing Cisco infrastructure over the course of the last few days, as the customer is experiencing issues with the network. Before i discuss and outlay the current configuration and topology please see below the list of issues being experienced -

1. DHCP requests appear to time out before eventually working leading to the problem where users are unable to log in and a message is displayed confirming Domain Controller not available for domain.

2. Frequent network dropouts occur on desktops at random times

3. Outlook and Exchange connectivity is intermittantly disconnected

4. Users are freqeuntly requested to re-enter their credentials during certain points of the day to re-authenticate their Outlook, intranet sessions

Please bear with me on this, as it is an extremely complicated network which has been extremely poorly configured. I will do my best to describe how the network is interconnected and what the exact topology is. There was recently a movement towards segmenting the network from a flat VLAN (1) to multiple VLANs/segements. This has since been abandoned and left in a further complicated state with VLANs 1 and 3 appearing to cross over each other.

Servers

Servers are a mix of physical HP servers or blade servers within a C7000 blade chassis with HP ProCurve 6120XG blade switch interconnects. The servers run VMWare ESXi 4.1 within a multi clustered VMWare 4.1 vSphere environment. Although not specifically relevant, there are approximately 200VMs setup and live within a mix of Citrix Xen Server and Microsoft. Citrix Xen Server is primarily used for provisioning desktops as a VDI technology as the company are migrating to the use of thin clients.

The Distribution/Core Layer

The core comprises of two Cisco 3750G switches which are stacked. The switch has 18 SVIs configured for inter VLAN routing and makes use of EIGRP to dynamically route between networks. The network is running a public IP address schema internally which is a /16 network; as mentioned previoulsy, there was an attempt to break away from this which has since been abandoned therefore the flat VLAN on a /16 network is still present. There are approximately 23 user specified VLANs configured. All switch ports are configured as trunk ports with the native VLAN specified as default (1). The VLANs are as follows - 1,2,3,4,10,20,25,30,50,51,99,108,109,110,116,117,118,119,120,121,122,123,124. This switch is configured as the spanning tree root for VLANs 2-3 and 108-123 and runs only PVST (NOT rapid-pvst).

Blade Server Access Layer

As mentioned above, the Blade Chassis' are interconnected with a HP Procurve 6120XG blade switch and direclty uplink to a Cisco 4510r switch mentioned in the next section. The Procurve switch has three VLANs configured (1,124 and 106) with VLAN 1 being the untagged/native VLAN.

The Server Access Layer

The server access layer consists of a Cisco 4510r switch with redundant supervisors and 3 x 48 port switch cards. The switch has 11 SVIs configured for inter VLAN routing and makes use of EIGRP to dynamically route between networks. All servers connected to this switch are 4Gbps Etherchannels with a mix of "switchport mode access" and "switchport access vlan 3" or "switchport mode trunk" configurations. This is the first instance where the VLAN 1 and VLAN 3 cross over begins to occur. The Port Channels of the servers running in Trunk mode do not specify a native VLAN and therefore default to VLAN 1. The Port Channels of the servers running in Access mode do specify the VLAN as 3. Finally there is a dual gigabit Port Channel connection to the Core 3750 Stack, again setup as a Trunk without a native VLAN specified; therefore defaulting to 1. The configuration matches this at the Core end of the uplink with the Port Channel setup as a Trunk with no Native VLAN specified so defaulting to 1.

The VLANs for this switch as as follows - 1,2,3,4,10,20,25,30,50,51,99,100,101,102,103,106,124,125,126,127. This switch is configured as the spanning tree root for VLANs 100-103 and 124-127.

The Desktop Access Layer

The desktop access layer comprises of a mix of Cisco 2950, Cisco 2960 and Cisco 3560 switches. There is no Layer 3 activity occurring on these switches. All VLANs specified above span all access switches. Each access switch uplinks to the Cisco 3750G Stack switch via 1GB single uplinks configured as Trunk ports; some removing VLANs from the Trunk and some allowing all VLANs. Some of the Trunk links are specified with the "Switchport trunk native vlan 3" command and some are simply left as "switchport mode trunk" leaving them in the default VLAN 1. The majority of ports are configured as Trunk ports; some in VLAN 3 and some left in the default VLAN 1; the remaining ports are configured as access ports, again some in VLAN 3 and some in VLAN 1

Many of the Access switches are experiencing Out Discard errors and RCV errors on a multitude of ports indicating a large volume of traffic filling the switchport buffer and subsequently causing the switch to drop frames. There is also a section of around 15 switches which are experiencing average CPU values of 50% and over for a 72 hour period. This has been diagnosed with the "show proc cpu history" command. The individual process respoinsible for the highest usage of CPU during this time is the "HULC LED Process" and during troubleshooting i did notice several interfaces flapping up and down. This was however, extremely rare and in most instances the ports were fine.

One interesting note came from the IOS software versions; the switches experiencing the high CPU usage were ALL running version 12.2(55)SE3 or greater. The switches not experiencing this issue were running version 12.2(35)SE5 or lower.

VOIP

Finally, there are VOIP phones present on the network with desktops plugged into the back of them. This may indicate why the vast majority of switch ports are configured as Trunks. The Voice VLAN is then tagged from the phone. There is QOS setup on all switches but i have not spent much time investigating this and I am suspicious as to if it is configured correctly. There are currently no issues with VOIP phone performance.

Routing

As mentioned previously, all Layer 3 switches use EIGRP to exchange routing topology. Desktops are then configured with the 3750 Core VLAN 3 SV interface for their default gateways. Servers are configured with the 4510r Server Access VLAN 3 SV interface for their default gateway. Although VLAN 1 and VLAN 3 are seperate the subnet and routing dictate that both are within the /16 network range and therefore overlap.

Conclusion

What i plan to recommend is to continue to segment the network into a structured VLAN design with the use of a 172.16.0.0/16 range subnetted. As far as the problems go, i fully believe the sheer size of the VLAN 1 and 3, and the misconfiguration of access and trunk ports is causing the majority of their problems. Along with the fact that VLANs sprawl all access switches, spanning tree is in PVST mode causing slow topology updates and the Voice VLANs are not configured appropriately. I would also recommend going with a correct 3 tier switching infrastructure with the 4510r configured as the Core, the 3750 stack configured as the Distribution and finally the remaining switches configured as the Access.

I would be greatly appreciated if anybody else could have a read through this and confirm any other worries or insight they may have (If any further information is required, or if i have left anything glaring from the write up please let me know). This would help me thoroughly present my findings to the customer and also leave me with the peace of mind ,that it has been peer reviewed by experts in the community.

Many Thanks

Jon Marshall · ‎08-16-2011

Nicholas

Firstly if VOIP is running okay but there are intermittent timeouts with normal apps then either they are very lucky or i suspect QOS has been setup with VOIP in mind, perhaps to the detriment of the other apps.

This reminds me of a thread you posted a while back about the 3 tier acrchitecture. I agree totally that you need to segment the network with vlans. This is, as you say, probably the biggest cause of the problems. And yes, plan to move from a public IP space to a private IP space. Don't underestimate how time consuming this can be though. Clients are easy, but you need to understand all the apps on the network as there may be some that use hardcoded IP addresses. Changing the IP address on a server could break that communication. Also bear in mind that some servers may need L2 adjacency so be careful if you plan to segment the servers into vlans as well.

Where i would disagree is the 3 tier architecture proposal. As i said in the previous thread, although you obviously didn't agree , a 3 tier architecture in a campus environment makes sense ie. the core is used to interconnect the separate buildings. But in a single building what do you gain from this ? ie. -

1) where do you propose to connect the servers to ? - certainly not the core in the Cisco design. So you have moved the 4500 to the core and now you have to connect the servers to something else.

2) what would you send over the core ? if you are in a single bulding you could -

i) connect the servers to dedicated switches and then connect to the core in which case you have merely introduced another hop to get from clients to servers

ii) connect the server to to dedicated switches which connect to the distribution switches, in which case what is the purpose of the core layer since all your inter-vlan routing will take place at the distribution layer.

It's not that having a core is wrong, it can be very useful. And if this client has multiple sites that it can interconnect with MAN connections etc. then it becomes vital. But i can't see where you are proposing to connect the servers if you move the 4500 to the core ?

If you could outline perhaps what advantages you think you would get from a 3 tier setup then it may be that i am missing something.

On the subject of the 4500. Have you checked for oversubscription on this switch. It all depends on the supervisor/modules/chassis you have but a 4500 with a SupV for example is limited to 6Gbps per slot which can be extremely limiting. You may need to recommend an upgrade for these switches.

Other recommendations -

1) limit vlans to specific switches if possible

2) use RSTP assuming all switches support it

3) potentially downgrade your access-layer switches if needed (assuming you don't need features available only in the later release

4) usual L2 stuff - ie. don't use vlan1 , make the native vlan an unused vlan and don't create a L3 SVI for it, separate vlan for management of switches.

The one strange thing ie. strange because it shouldn't work that well is that VOIP seems to be okay. VOIP packets are very sensitive to delay etc. and since you have clients actually dropping off the network you would think VOIP would also suffer. I strongly recommend looking at the QOS in more detail to see exactly what has been done with VOIP. It may be that other queues are getting starved of resources.

That said, even if you only segemented the network, i suspect this would make a big difference.

Jon

View solution in original post

Jon Marshall · ‎08-16-2011