Routing Loop due to high utilization of Router

Muhammad Mubashir Ali Khan · ‎01-01-2013

I've came across a very odd topology to deal with where everything is connected to everything, without proper utilization of VLANs.

- I've an L3 SW and a 3745 RTR at core

- Both are carrying same subnet to an L2 distribution switch that connects server farm within the same subnet (i.e. 1.1.3.x subnet)

- The L2 switch works as a passive switch, hence another network say, 1.1.2.x has been plugged into it as well.

- This L2 switch extends to other switches without configuration of any VLANs or STP and distributes 1.1.3.x network.

Periodically and unexpectedly, the router starts hanging and utilization goes beyond 80%, and there's nothing at all that is observed in "sh proc cpu" to be eating router resources. Its quite difficult to observe the pattern, as its random. Does anyone has idea what's going on and how to go about troubleshooting it?

Nagendra Kumar Nainar · ‎01-02-2013

Hi,

Did I understood your topology right that you have servers (farm) connected to one single L2 switch whcih is also connected to a L3 switch and a 3745 router?. From your description,it appears that 3745 or L3 switch acts as gateway for devices/servers in 1.1.3.x network. But which is ur gateway for 1.1.2.x network?.

Is your L3 switch acts as a pure L3 device (doing only routing) or acts as L2/L3 switch?.

If you have the L3 switch acting as L2/L3 or if you have acluster of switches in ur distribution layer (connecting server farm and router/L3 switch), there are possibility for l2 loop, but per my understanding, I dont think there will be any intermittent routing loop. What routing protocol are you using between L3 switch and router?.

While it is not a good design to have different subnets in same vlan. If they are in same switch, it will not cause any loops. But if multiple switches are involved, mysuggestion would be to redesign with vlan segregation.

-Nagendra

Muhammad Mubashir Ali Khan · ‎01-02-2013

Nagendra,

Many thanks for your reply. I will categorically reply to your post.

1. You are right my server farm is connected in this fashion, as depicted in the diagram attached. The gateway for Server Farm is 1.1.3.240 (3745). For 1.1.2.X again, 3745 is the gateway

2. My L3 acts both L2 and L3.

3. I'm not sure whether its a loop or just high cpu utilization. What I can tell you that you are right, we have cluster of switches here, that are running in passive mode. I'm using EIGRP all over my enterprise.

My take on it is that, maybe STP is creating this chaos. Every once in a while we perform any activity on 1.1.3.X network, say reboot a machine or install some other server, everything goes berserk, literally, seems like its a broadcast storm.

One more thing, we have a server 1.1.3.234 which has g/w 1.1.3.240 is attached to an L2 switch in datacenter; this L2 switch is connected to our meshed L2 switches that is connected to both L3 and 3745. On a datacenter L2 Switch, the port with which this server is connected flaps amber continuously, I've hardcoded the parameters, even then it does so. Maybe this is the cause? Suggestions?

Joseph W. Doherty · ‎01-02-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

It doesn't take much LAN traffic (e.g. single 100 Mbps duplex) to max out a 3745. Are you positive it's not just traffic loading? Is the high CPU loading mostly "interrupt" usage?

Muhammad Mubashir Ali Khan · ‎01-02-2013

Joseph,

Thanks a lot for your post. Correct, I'm not counting LAN traffic as a culprit. I believe interrupt can be seen by issuing "show processes cpu" command. There's absolutely no utilization in the mentioned columns, the highest was in decimals, if I remember. But overall, the router was bogged down with staggering 85% utilization.

Joseph W. Doherty · ‎01-03-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

So to confirm you see something like:

CPU utilization for five seconds: 84%/81%; one minute: 61%; five minutes: 61%

Interrupt CPU is the second percentage value for five seconds. Process CPU is the first percentage value less the second. If the delta is only 2 or 3 %, router is forwarding packets. A high interrupt value will be reflect high traffic load, for the capacity of the device.

If you see something like:

CPU utilization for five seconds: 84%/11%; one minute: 61%; five minutes: 61%

The large delta shows much process CPU and what's using it should be visible in the following process usage statistics (for the same command). Depending what process(es) is consuming the CPU, high usage might be mitigated.