Please Put All Your Eggs in One Basket!

bolick · ‎03-11-2019

Way back in the early 2000s, we noticed a trend in branch routing. Customers started asking for a “branch in a box.” The problem they were having was that the complexity of networks of the day required lots of stuff[1]. They needed separate devices for routing, switching, VoIP, firewalling, WAN optimizing, app serving and lots of other stuff. All of these devices took up space, added complexity and, sometimes most importantly, added noise to the office.

Like any good company interested in selling lots of stuff serving our customers’ needs, Cisco responded with the 2600/3600 and later the ISR series of products. ISR stands for Integrated Services Router which is a perfect name for what this device did[2]. The ISR could integrate routing, switching, call control, firewall, application hosting, WAAS and even charge your cell phone[3]. Naturally the naysayers[4] immediately responded with cries of “too many eggs in one basket” and “your branch will fail and you’ll have to move in with your parents if you do this!” While that objection might feel intuitive it shows a fundamental misunderstanding of reliability statistics[5]. Let me show you why putting everything into a single device is actually better for the network than using separate dedicated devices.

First, let’s think about what we mean by a “critical component” to the network. Something is critical if the failure causes a significant or complete reduction in functionality in the network. If you lose the LAN switch that your devices connect to, that’s a critical component. The same is true for the WAN router but may or may not be true for the WAN Optimizer, app server, call control or other service if the branch can still function possibly with a backup service in the cloud. Let’s assume in that we have 5 of these “critical” components in a typical branch office: switch, server, router, WAN optimizer, and firewall. If any of these fail we kick the customers out and send the staff home. Each of these devices has a reliability associated with it. Perfect reliability would be 1.0 and reliability of 0.8 in a given year is good for a complex piece of compute hardware with power supplies and fans so let’s use that for all devices[6]. This is known as serial reliability and the reliability of the entire branch is the reliability of all devices multiplied together. In this example that would be 0.8⁵ or 0.33 for the system. That means the annual failure rate for the branch is 67%!

So how can we get this number higher? Well if we take a look at the reliability of components in most of these devices it’s pretty obvious what causes most failures: fans and power supplies[7]. Here’s a look at the Mean Time Between Failures, Failure Rate, and Uptime for common components in any system. As you can see the things we can actually control when building a device are largely the power supply and fan reliability including adding additional redundant power supplies and fans.

	MTBF (hours - higher is better)	Annual Failure Rate	Uptime (assumes 24hrs to repair)
ECC Memory	~455 years	0.22%	>99.999%
CPU	~300 years	0.33%	>99.999%
Hard Drive	730,500	1.2%	99.995%
SSD	137 years	0.73%	99.995%
Power Supply	400,000	2.2%	99.99%
Fans	280,000	3.1%	99.99%
Business WAN	584,400	1.5%	99.995%
Residential WAN	265,636	3.3%	99.99%
Power Main	6,738	130%	99.5%

So, if we can eliminate those high-failure-rate components from most of the devices in our branch and add redundancy in the place we’re “integrating” them, it might look something like this. The reliability of 4 of the devices went up because they no longer have their own fan or power supply while the reliability of the router also went up because it has redundant fans and power supplies[8]. In this scenario reliability improves to 0.62 with an annual failure rate of 38%. While that’s definitely better, it shows that no matter how reliable a device is the series reliability will be decreased when adding it to the network.

See? Putting all your eggs in one basket is a really good thing. This isn’t rocket science. Data centers and cloud providers figured all this out a long time ago. In the early days of server virtualization there were plenty of detractors worried about replacing multiple physical servers with a single hypervisor. Today it is assumed that putting all of your VMs in a single, reliable server or cloud provider is going to improve reliability rather than reduce it. The same is true for critical network services.

What if that level of reliability isn’t good enough for you? That’s where network design can really help you out by using redundant physical hardware. Parallel redundancy is statistically a much better option than serial redundancy. If you assume two redundant integrated systems of 0.62 reliability, the parallel reliability is 0.85 with an annual failure rate of only 15%! Of course, there’s additional cost and complexity involved in a dual router setup, but if availability of the network is important nothing beats redundant hardware.

Disclaimer: The actual numbers used in this blog post are for demonstration purposes only and were largely pulled straight from the south end of a north-bound burro. PLEASE do not use them for any real-world planning unless your resumé is in tip-top shape.

[1] “Lots of stuff” is a highly technical quantity greater than 2 but less than “tons of stuff.”

[2] Rejected names were Branch Universal Gateway (BUG), Services Unifying Controller (SUC), and Branch Unifying Terminal (BUT). (Guess which one of those was real.)

[3] Seriously. Why did you think we put 2 USB ports on there?

[4] Mostly companies that are now out of business funny enough.

[5] or a full understanding of competitive marketing.

[6] This number is just to demonstrate a point. Cisco devices (especially ISRs) typically have much greater reliability but those numbers are secret and stuff.

[7] Power mains and WAN circuits are both WAY less reliable than anything here but adding dual WAN circuits and diverse paths along with backup power is beyond the space they let me have for a single blog.

[8] This is known as parallel redundancy which is equal to: 1-[(1-R₁)(1-R₂)…(1-R_n)]. It’s the same whether you’re talking about redundant fans, redundant switches, or redundant routers.

Alan Ng'ethe · ‎03-14-2019

Thanks for the article.

[2] I wonder what the rationale was behind being able to charge a phone via an ISR router.

[7] That has been my experience, too.

bolick · ‎03-15-2019

The phone charger feature was an added benefit to adding USB ports to the router. We didn't build it that way, but it does come in handy from time to time. ;-)

Thanks for the feedback!

Matt

Please Put All Your Eggs in One Basket!

Loading an IOS on a switch via Xmodem

Glimpse of "EIGRP name mode configuration"

Understanding Wireless Client Authentication