cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2916
Views
5
Helpful
6
Replies

Cisco Collapsed Core - Network Issues

Nicholas Beard
Level 1
Level 1

Hi guys,

I have been tasked with auditing an existing Cisco infrastructure over the course of the last few days, as the customer is experiencing issues with the network.  Before i discuss and outlay the current configuration and topology please see below the list of issues being experienced -

1.  DHCP requests appear to time out before eventually working leading to the problem where users are unable to log in and a message is displayed confirming Domain Controller not available for domain.

2.  Frequent network dropouts occur on desktops at random times

3.  Outlook and Exchange connectivity is intermittantly disconnected

4.  Users are freqeuntly requested to re-enter their credentials during certain points of the day to re-authenticate their Outlook, intranet sessions

Please bear with me on this, as it is an extremely complicated network which has been extremely poorly configured.  I will do my best to describe how the network is interconnected and what the exact topology is.  There was recently a movement towards segmenting the network from a flat VLAN (1) to multiple VLANs/segements.  This has since been abandoned and left in a further complicated state with VLANs 1 and 3 appearing to cross over each other.

Servers

Servers are a mix of physical HP servers or blade servers within a C7000 blade chassis with HP ProCurve 6120XG blade switch interconnects.  The servers run VMWare ESXi 4.1 within a multi clustered VMWare 4.1 vSphere environment.  Although not specifically relevant, there are approximately 200VMs setup and live within a mix of Citrix Xen Server and Microsoft.  Citrix Xen Server is primarily used for provisioning desktops as a VDI technology as the company are migrating to the use of thin clients.

The Distribution/Core Layer

The core comprises of two Cisco 3750G switches which are stacked.  The switch has 18 SVIs configured for inter VLAN routing and makes use of EIGRP to dynamically route between networks.  The network is running a public IP address schema internally which is a /16 network; as mentioned previoulsy, there was an attempt to break away from this which has since been abandoned therefore the flat VLAN on a /16 network is still present.  There are approximately 23 user specified VLANs configured.  All switch ports are configured as trunk ports with the native VLAN specified as default (1).  The VLANs are as follows - 1,2,3,4,10,20,25,30,50,51,99,108,109,110,116,117,118,119,120,121,122,123,124.  This switch is configured as the spanning tree root for VLANs 2-3 and 108-123 and runs only PVST (NOT rapid-pvst). 

Blade Server Access Layer

As mentioned above, the Blade Chassis' are interconnected with a HP Procurve 6120XG blade switch and direclty uplink to a Cisco 4510r switch mentioned in the next section.  The Procurve switch has three VLANs configured (1,124 and 106) with VLAN 1 being the untagged/native VLAN.

The Server Access Layer

The server access layer consists of a Cisco 4510r switch with redundant supervisors and 3 x 48 port switch cards.  The switch has 11 SVIs configured for inter VLAN routing and makes use of EIGRP to dynamically route between networks.  All servers connected to this switch are 4Gbps Etherchannels with a mix of "switchport mode access" and "switchport access vlan 3" or "switchport mode trunk" configurations.  This is the first instance where the VLAN 1 and VLAN 3 cross over begins to occur.  The Port Channels of the servers running in Trunk mode do not specify a native VLAN and therefore default to VLAN 1.  The Port Channels of the servers running in Access mode do specify the VLAN as 3.  Finally there is a dual gigabit Port Channel connection to the Core 3750 Stack, again setup as a Trunk without a native VLAN specified; therefore defaulting to 1.  The configuration matches this at the Core end of the uplink with the Port Channel setup as a Trunk with no Native VLAN specified so defaulting to 1.

The VLANs for this switch as as follows - 1,2,3,4,10,20,25,30,50,51,99,100,101,102,103,106,124,125,126,127.  This switch is configured as the spanning tree root for VLANs 100-103 and 124-127. 

The Desktop Access Layer

The desktop access layer comprises of a mix of Cisco 2950, Cisco 2960 and Cisco 3560 switches. There is no Layer 3 activity occurring on these switches. All VLANs specified above span all access switches.  Each access switch uplinks to the Cisco 3750G Stack switch via 1GB single uplinks configured as Trunk ports; some removing VLANs from the Trunk and some allowing all VLANs.  Some of the Trunk links are specified with the "Switchport trunk native vlan 3" command and some are simply left as "switchport mode trunk" leaving them in the default VLAN 1.  The majority of ports are configured as Trunk ports; some in VLAN 3 and some left in the default VLAN 1; the remaining ports are configured as access ports, again some in VLAN 3 and some in VLAN 1

Many of the Access switches are experiencing Out Discard errors and RCV errors on a multitude of ports indicating a large volume of traffic filling the switchport buffer and subsequently causing the switch to drop frames.  There is also a section of around 15 switches which are experiencing average CPU values of 50% and over for a 72 hour period.  This has been diagnosed with the "show proc cpu history" command.  The individual process respoinsible for the highest usage of CPU during this time is the "HULC LED Process" and during troubleshooting i did notice several interfaces flapping up and down.  This was however, extremely rare and in most instances the ports were fine. 

One interesting note came from the IOS software versions; the switches experiencing the high CPU usage were ALL running version 12.2(55)SE3 or greater.  The switches not experiencing this issue were running version 12.2(35)SE5 or lower.

VOIP

Finally, there are VOIP phones present on the network with desktops plugged into the back of them.  This may indicate why the vast majority of switch ports are configured as Trunks.  The Voice VLAN is then tagged from the phone.  There is QOS setup on all switches but i have not spent much time investigating this and I am suspicious as to if it is configured correctly.  There are currently no issues with VOIP phone performance.

Routing

As mentioned previously, all Layer 3 switches use EIGRP to exchange routing topology.  Desktops are then configured with the 3750 Core VLAN 3 SV interface for their default gateways.  Servers are configured with the 4510r Server Access VLAN 3 SV interface for their default gateway.  Although VLAN 1 and VLAN 3 are seperate the subnet and routing dictate that both are within the /16 network range and therefore overlap.

Conclusion

What i plan to recommend is to continue to segment the network into a structured VLAN design with the use of a 172.16.0.0/16 range subnetted.  As far as the problems go, i fully believe the sheer size of the VLAN 1 and 3, and the misconfiguration of access and trunk ports is causing the majority of their problems.  Along with the fact that VLANs sprawl all access switches, spanning tree is in PVST mode causing slow topology updates and the Voice VLANs are not configured appropriately.  I would also recommend going with a correct 3 tier switching infrastructure with the 4510r configured as the Core, the 3750 stack configured as the Distribution and finally the remaining switches configured as the Access.

I would be greatly appreciated if anybody else could have a read through this and confirm any other worries or insight they may have (If any further information is required, or if i have left anything glaring from the write up please let me know).  This would help me thoroughly present my findings to the customer and also leave me with the peace of mind ,that it has been peer reviewed by experts in the community.

Many Thanks

1 Accepted Solution

Accepted Solutions

Jon Marshall
Hall of Fame
Hall of Fame

Nicholas

Firstly if VOIP is running okay but there are intermittent timeouts with normal apps then either they are very lucky or i suspect QOS has been setup with VOIP in mind, perhaps to the detriment of the other apps.

This reminds me of a thread you posted a while back about the 3 tier acrchitecture. I agree totally that you need to segment the network with vlans. This is, as you say, probably the biggest cause of the problems. And yes, plan to move from a public IP space to a private IP space. Don't underestimate how time consuming this can be though. Clients are easy, but you need to understand all the apps on the network as there may be some that use hardcoded IP addresses. Changing the IP address on a server could break that communication.  Also bear in mind that some servers may need L2 adjacency so be careful if you plan to segment the servers into vlans as well.

Where i would disagree is the 3 tier architecture proposal. As i said in the previous thread, although you obviously didn't agree , a 3 tier architecture in a campus environment makes sense ie. the core is used to interconnect the separate buildings. But in a single building what do you gain from this ?  ie. -

1) where do you propose to connect the servers to ? - certainly not the core in the Cisco design. So you have moved the 4500 to the core and now you have to connect the servers to something else.

2) what would you send over the core ? if you are in a single bulding you could -

i) connect the servers to dedicated switches and then connect to the core in which case you have merely introduced another hop to get from clients to servers

ii) connect the server to to dedicated switches which connect to the distribution switches, in which case what is the purpose of the core layer since all your inter-vlan routing will take place at the distribution layer.

It's not that having a core is wrong, it can be very useful. And if this client has multiple sites that it can interconnect with MAN connections etc. then it becomes vital. But i can't see where you are proposing to connect the servers if you move the 4500 to the core ?

If you could outline perhaps what advantages you think you would get from a 3 tier setup then it may be that i am missing something.

On the subject of the 4500. Have you checked for oversubscription on this switch. It all depends on the supervisor/modules/chassis you have but a 4500 with a SupV for example is limited to 6Gbps per slot which can be extremely limiting. You may need to recommend an upgrade for these switches.

Other recommendations -

1) limit vlans to specific switches if possible

2) use RSTP assuming all switches support it

3) potentially downgrade your access-layer switches if needed (assuming you don't need features available only in the later release

4) usual L2 stuff - ie. don't use vlan1 , make the native vlan an unused vlan and don't create a L3 SVI for it, separate vlan for management of switches.

The one strange thing ie. strange because it shouldn't work that well is that VOIP seems to be okay. VOIP packets are very sensitive to delay etc. and since you have clients actually dropping off the network you would think VOIP would also suffer. I strongly recommend looking at the QOS in more detail to see exactly what has been done with VOIP. It may be that other queues are getting starved of resources.

That said, even if you only segemented the network, i suspect this would make a big difference.

Jon

View solution in original post

6 Replies 6

Jon Marshall
Hall of Fame
Hall of Fame

Nicholas

Firstly if VOIP is running okay but there are intermittent timeouts with normal apps then either they are very lucky or i suspect QOS has been setup with VOIP in mind, perhaps to the detriment of the other apps.

This reminds me of a thread you posted a while back about the 3 tier acrchitecture. I agree totally that you need to segment the network with vlans. This is, as you say, probably the biggest cause of the problems. And yes, plan to move from a public IP space to a private IP space. Don't underestimate how time consuming this can be though. Clients are easy, but you need to understand all the apps on the network as there may be some that use hardcoded IP addresses. Changing the IP address on a server could break that communication.  Also bear in mind that some servers may need L2 adjacency so be careful if you plan to segment the servers into vlans as well.

Where i would disagree is the 3 tier architecture proposal. As i said in the previous thread, although you obviously didn't agree , a 3 tier architecture in a campus environment makes sense ie. the core is used to interconnect the separate buildings. But in a single building what do you gain from this ?  ie. -

1) where do you propose to connect the servers to ? - certainly not the core in the Cisco design. So you have moved the 4500 to the core and now you have to connect the servers to something else.

2) what would you send over the core ? if you are in a single bulding you could -

i) connect the servers to dedicated switches and then connect to the core in which case you have merely introduced another hop to get from clients to servers

ii) connect the server to to dedicated switches which connect to the distribution switches, in which case what is the purpose of the core layer since all your inter-vlan routing will take place at the distribution layer.

It's not that having a core is wrong, it can be very useful. And if this client has multiple sites that it can interconnect with MAN connections etc. then it becomes vital. But i can't see where you are proposing to connect the servers if you move the 4500 to the core ?

If you could outline perhaps what advantages you think you would get from a 3 tier setup then it may be that i am missing something.

On the subject of the 4500. Have you checked for oversubscription on this switch. It all depends on the supervisor/modules/chassis you have but a 4500 with a SupV for example is limited to 6Gbps per slot which can be extremely limiting. You may need to recommend an upgrade for these switches.

Other recommendations -

1) limit vlans to specific switches if possible

2) use RSTP assuming all switches support it

3) potentially downgrade your access-layer switches if needed (assuming you don't need features available only in the later release

4) usual L2 stuff - ie. don't use vlan1 , make the native vlan an unused vlan and don't create a L3 SVI for it, separate vlan for management of switches.

The one strange thing ie. strange because it shouldn't work that well is that VOIP seems to be okay. VOIP packets are very sensitive to delay etc. and since you have clients actually dropping off the network you would think VOIP would also suffer. I strongly recommend looking at the QOS in more detail to see exactly what has been done with VOIP. It may be that other queues are getting starved of resources.

That said, even if you only segemented the network, i suspect this would make a big difference.

Jon

Jon,

Thanks once again for your exceptional knowledge in this area and your promptness to respond.  I must stress this is a new customer to the previous post, and therefore the requirement for a three tier core/distribution/access model may indeed prove beneficial.  There are numerous out buildings here with multiple switching cabinets.

Thankfully, I am only in a position to make recommendations and am not performing the overhaul myself as this could be a quite in depth and time consuming process to perform.  You are indeed correct regarding the servers, as there are numerous Microsoft based servers here that would not react kindly to having their IP Address information changed.

What i did not mention previously is the requirement to support multiple (50+) remote sites directly connected to this network over 10MB and 100MB LES circuits.  These currently make use of 2 Cisco 2821 routers directly connected from the 3750 core switch stack.

With regard to troubleshooting the oversubscription of the 4510r switch.  I have spent some time checking the CPU and Memory usage statistics of the switch using commands such as "show proc cpu sorted", "show proc cpu history", "show platform health" and "show int counters errors".  The switch does not indicate any issues occurring at all.  I believe the switch is capable of supporting up to 48Gbps per slot but thanks for identifying this, as it is something i will double check.

I think by moving to a strategically segmented Layer 2 network, removing the VLAN sprawl, applying best practice to access and trunk ports, and  moving to a private address range will significantly improve their experience.  I think once this has been completed we could possibly better troubleshoot any potential problems.

Nick

Nick

I think by moving to a strategically segmented Layer 2 network, removing the VLAN sprawl, applying best practice to access and trunk ports, and  moving to a private address range will significantly improve their experience.  I think once this has been completed we could possibly better troubleshoot any potential problems.

Yes, i agree, doing that would be a very good first step. There may then be further issues but as you say it would make troublshooting a lot easier.

You say there are 50+ sites all connected via 10/100Mbps LES links. I'm confused as LES links tend to be point to point so how can 50 sites be terminated on just 2 2821 routers ?   Alternatively the remote sites could be using LES circuits to connect to a cloud but then 50 sites running at either 10 or 100Mbps could overload the main site links. What speed links do you have on the 2821 routers ?

Is this a future requirement or is it there now ?  Can you clarify what the state of the LES links are at the moment ?

Jon

Disclaimer

The   Author of this posting offers the information contained within this   posting without consideration and with the reader's understanding that   there's no implied or expressed suitability or fitness for any purpose.   Information provided is for informational purposes only and should not   be construed as rendering professional advice of any kind. Usage of  this  posting's information is solely at reader's own risk.

Liability Disclaimer

In   no event shall Author be liable for any damages whatsoever (including,   without limitation, damages for loss of use, data or profit) arising  out  of the use or inability to use the posting's information even if  Author  has been advised of the possibility of such damage.

Posting

Nicholas Beard wrote:

switch does not indicate any issues occurring at all.  I believe the switch is capable of supporting up to 48Gbps per slot but thanks for identifying this, as it is something i will double check.

You need the "right" chassis, supervisor and line cards to obtain 48 Gbps per slot.  Ditto for 24 Gbps per slot.  Otherwise, as Jon mentions, you're limited to 6 Gbps per slot.

PS:

You might also check the hasing algorithm being used for the Etherchannel between the 4500 and the 3750G stack.

Thanks for the response; this is something I am going to confirm today.  With regards to the hashing algorithm, it is src-dst-mac currently being used.

Disclaimer

The    Author of this posting offers the information contained within this    posting without consideration and with the reader's understanding that    there's no implied or expressed suitability or fitness for any  purpose.   Information provided is for informational purposes only and  should not   be construed as rendering professional advice of any kind.  Usage of  this  posting's information is solely at reader's own risk.

Liability Disclaimer

In    no event shall Author be liable for any damages whatsoever  (including,   without limitation, damages for loss of use, data or  profit) arising  out  of the use or inability to use the posting's  information even if  Author  has been advised of the possibility of such  damage.

Posting

src-dst-mac, then you might want to check individual link loads.  If you're routing between the core 3750G stack and the 4500, src-dst-mac should stay the same and then only one link would be used.  For Etherchannel links between servers and the 4500, here too mac on both sides would be the same and again only one link would be used.

Some of the later switches offer src-dsc-IP, which might remediate a load balancing issue.

PS:

From what you've described, doubt this alone is the cause of poor network performance.

Review Cisco Networking for a $25 gift card