Ahh, I see. Yes, you've

stuartkendrick · ‎01-23-2017

Does anyone have insights into interpreting the per-platform MTBF numbers which Cisco publishes?

Model MTBF (Hours)
WS-C4503-E 1,064,279
WS-C4506-E 710,119
WS-C4507R+E 248.630
WS-C4510R+E 179,714

The way I interpret this: if I load all (4) of these boxes full of line cards (or, in the case of the 4507 and 4510, perhaps with line cards plus a redundant Sup) and run them for years, I can expect hardware failures to show up roughly five times more frequently on the WS-C4510R+E than on the WS-C4503-E. And that this rate is driven by a combination of the number of ports and Sup cards (the more components in a box, the higher the likelihood that something will fail).

I suppose an alternate interpretation would involve the chassis / backplane itself -- perhaps this fails more frequently on the Cat4510 than on the Cat4503. Or perhaps the rates are driven by the power supplies, perhaps these fail at a higher rate on the C4510.

Does anyone have insight into this?

Moving to the stackable lines, I see for example that the Cat 2960X-48FPD-L is associated with an MTBF of 233,370 hours.

So, if I want to compare the MTBF for an IDF serviced by:
- (1) Cat4510 equipped with (8) line cards
- (8) Cat2960X stacked together
would I say:
Cat4510 179,714 hours
Cat2960X 29,171 hours

[The second number derived as follows: 233,370 hours per switch / 8 switches = 29, 171 hours.]
Of course, the Cat2960X calculation does not include the (8) stacking modules nor the (9) stacking cables, so that's a gap.

?

--sk

Joseph W. Doherty · ‎01-23-2017

I believe the computation for actual failure impact, is a bit more complicated then you're figuring as the chassis is only one component.

If there's a chain of devices, you need to compute for the whole chain. In generally, if items are in series (for a potential failure), system MTBF decreases. If items are in parallel, MTBF increases.

Although you're asking about 4500s, this whitepaper, http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-switches/prod_white_paper0900aecd801c5cd7.pdf, for 6500s, might help you figure actual device MTBF. (NB: the whitepaper doesn't show two ingress cards, for full single chassis redundancy, if it did, its availability would be better than what's shown. Also note, the three example configurations use different values for SW upgrades MTTR.)

stuartkendrick · ‎03-03-2017

OK, I've read the Cat6500 white paper, which sketches out how hard this is.

I also read up on calculating MTBF with multiple components:

http://answers.google.com/answers/threadview?id=390140

And now I'm ready to try my hand at this.

Recall:

233,370 hours: MTBF for Catalyst 2960X-48FPD-L
http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-2960-x-series-switches/data_sheet_c78-728232.html
• Assume that this figure covers a stand-alone Cat2960X, for a total port count of (48).

179.714 hours: MTBF for Catalyst WS-C4510R+E:
http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-4500-series-switches/product_data_sheet0900aecd801792b1.html
• Assume that this figure covers a Cat4510R equipped with redundant power supplies and (8) line cards, for a total port count of (384).

Lamba 1 = 1,000,000 / 179.714 = 5564
Lamba 2 = 1,000,000 / 179.714 = 5564
Lamba 3 = 1,000,000 / 179.714 = 5564
Lamba 4 = 1,000,000 / 179.714 = 5564
Lamba 5 = 1,000,000 / 179.714 = 5564
Lamba 6 = 1,000,000 / 179.714 = 5564
Lamba 7 = 1,000,000 / 179.714 = 5564
Lamba 8 = 1,000,000 / 179.714 = 5564

Lambda Stack = 5564 * 8 = 44,512
MTBF Stack = 1,000,000 / 44,512 = 22 hours

Since this calculation ignores Stacking Cables and Stacking Modules, the actual figure would be much lower.

Now, stepping back for a moment – I don’t believe this answer. I don’t believe that an (8) Member Catalyst 2960X Stack will, typically, experience a hardware failure every 22 hours

So, clearly, I'm fumbling something. Has anyone else tackled this problem, can offer insights as to how I'm derailing?

--sk

Joseph W. Doherty · ‎03-03-2017

Yes, I believe your stack MTBF issue is, you're treating each stack member as as serial component whose failure takes down the whole stack. If a stack member fails, the whole stack does not fail.

stuartkendrick · ‎03-03-2017

OK, my first error was typing a decimal point instead of a comma ... that threw off the arithmetic by a factor of a thousand. And I swapped the Cat2960X MTBF with the Cat4500 MTBF.

Here is a revised calculation:

Lamba 1 = 1,000,000 / 233,370 = 4.3

Lamba 2 = 1,000,000 / 233,370 = 4.3

Lamba 3 = 1,000,000 / 233,370 = 4.3

Lamba 4 = 1,000,000 / 233,370 = 4.3

Lamba 5 = 1,000,000 / 233,370 = 4.3

Lamba 6 = 1,000,000 / 233,370 = 4.3

Lamba 7 = 1,000,000 / 233,370 = 4.3

Lamba 8 = 1,000,000 / 233,370 = 4.3

Lambda Stack = 4.3 * 8 = 34.4

MTBF Stack = 1,000,000 / 44,512 = 29,070 hours

And to your point, this approach assumes that the failure of any Member takes out the entire Stack ... I suspect that this is happening on one of my Stacks now ... that a failing Member starts a cascade of events which results in the entire Stack hanging ... but generally is not true: generally, a single Member can fail without impacting the rest of the Stack.

So another approach would be to simply take the MTBF of a single Member -- 233,370 hours -- and divide by (8) (eight Members in a Stack):

233,370 / 8 = 29,171 hours

Perhaps by chance, this number is awfully similar to the previous approach.

Both approaches ignore the influence of Stacking Cables & Modules.

And then, for comparison, I propose that the failure rate of a fully-loaded Cat4500 (eight line cards) would be ~179,000 hours.

So, comparing a (384) port Cat2960X IDF with a Cat4510 IDF:

~29,000 hours MTBF vs ~179,000 hours MTBF

With the Cat2960X MTBF likely lower than that figure, on account of the missing Stacking hardware.

This seems at least plausible. But I realize this is a complex space, and I'm naive to it. Have I stumbled somewhere?

--sk

Joseph W. Doherty · ‎03-06-2017

I believe you're not yet on the right track for a stack calculation.

On a stack (assuming it works correctly - i.e. one stack member shouldn't take down the whole stack), additional stack member should increase MTBF for the stack as whole (especially useful if you connect to two [or more] stack members).

However, as a stack may not function correctly (as that could partition the stack) if it loses two stack members, you might compute a stack's MTBF using only two stack members as redundant components if the stack has two or more stack members.

stuartkendrick · ‎03-06-2017

I appreciate your helping me think through this Joseph.

- Seems to me that one way of thinking about MTBF is "how long can I run until the first hardware element fails". So, for example, how long can I run until the first power supply fails (I'm picking power supply since, anecdotally, I would claim that fans / power supplies are the things which fail most frequently). From that perspective, then I claim my ~29,000 hours calculation is on the right track: the more power supplies / fans I own, the more likely that at least one of them will fail on any given day.

- I propose that the way you are thinking about MTBF is: "how long can I run until the entire Stack is down?" And yes, I can see how adding Members to a Stack might *increase* MTBF -- as each Member fails, another Member takes over the Master role, keeping the Stack functioning. In fact, in some sense, the Stack is still up & running just fine even if (7) of its (8) Members are down. [Although a user attached to a failed Member might struggle to agree with this perspective.]

- My mgmt cares about service disruption to end-users; they are less comforted by my claim of "Well, most folks are still running fine; it just the (48) users attached to the failed Member who are out of action" and more focused on "How do we get those (48) users up & running?" So I am focused on Service Disruption ... i.e. "MTTSD", Mean Time To Service Disruption.

- For the Cat2960X, failure of a power supply means service disruption, so again, that ~29,000 figure seems to me to be on the right track. For most other platforms, equipped with redundant power supplies ... I would then say ... well, MTBF to the failure of the first power supply ... isn't particular interesting. In fact, at that point, I would only care if *both* power supplies were down at the same time, causing service disruption.

- Still, given my perspective of wanting to calculate MTTSD , I see several challenges that at this moment I do not know how to surmount:

(a) Does that C4K 179,000 number count MTBF including power supplies? or MTTSD (the chance of a failed power supply causing a Service Disruption being fairly small, given a dual-power supply chassis)?

(b) How to include the effect of failing Stacking Cables & Modules? I suspect that the Cat2960X 233,000 hour figure does *not* include this influence?

(c) How to include my own experience ... in my experience, entire Stacks are severely degraded by the (intermittent) failure of a single element ... typically a failing Member ... I'm working a TAC case right now where we suspect intermittently failing RAM / TCAM in one of the Members ... sometimes a failing Stacking Cable or Module. Is my experience just an outlier (I started managing network gear in 1991), in which case, we ignore this contribution. Or is my experience fairly typical, in which case, I very much want to include the effect that the more Members one adds to the Stack, the more likely that the entire Stack will suffer consequences when a single element degrades.

How would you push back on this thinking?

--sk

Joseph W. Doherty · ‎03-06-2017

Yes, I'we've been treating "the stack" as the element we're trying to compute a MTBF for, but you're correct a individual stack member failure may impact devices that connect to it, but so would a line card failure on a chassis. (Oh, and not too long ago, I had half of the ports of a line card fail - suspect the hardware controlling that part of the line card failed.)

I agree, power supplies are often the more failure prone components, and so a chassis that can run off one of two power supplies is more failure proof than a stack member which only has one power supply per stack member. However, to address that, on "better" stack devices, they often support two power supplies per stack member. Another "improvement" is stack members that can share power between stack members. Such improvements try to improve MTBF of low end switches relative to high end switch chassis, but low end is just that. Usually the only major advantage of a stack of something like 2960Xs offer over a 4500 chassis is less cost.

well, MTBF to the failure of the first power supply ... isn't particular interesting. In fact, at that point, I would only care if *both* power supplies were down at the same time, causing service disruption.

Actually the MTBF for the failure a power supply is very interesting. Assuming there are two and each has the same value, and they are redundant, you can then calculate the expected MTBF for the pair, which as you also say, you're interested in as that is what would cause a service disruption.

(a) Yes, overall device failure due to failure of two redundant power supplies is often very low, assuming you also replace a failed PS ASAP.

(b) don't know unless Cisco can tell us what their MTBF is, or whether is has been already added to the device's quoted MTBF.

(c) Yea, again, that's one of the not well documented differences between low end platforms and more expensive platforms.

Of course, when dealing with MTBF, while trying to protect your users, remember these numbers are engineering "guesses" based on statistics.

Unfortunately, even if you chose one device with a MTBF that 10x better than another device, it doesn't really guarantee the 10x better device won't actually fail before the other.

In other words, for push back on your thinking, don't worry so much about MTBF being precise, worry more about what can cause a stoppage, and how by design you can minimize it happening and/or mitigate its impact.

For example, say we need 40 user ports. We could connect all of them to one 48 port switch, which subjects all those users to the MTBF of one switch. However, if we connect them to a stack of two 24 port switches, only half the users are impacted if a switch fails. Further, if we use three 24 port switches, not only would only one third of your users be impacted by a switch member outage, but you can repatch the failed switch ports to the two other switch members. I.e. those one third users might only be impacted for minutes rather than hours or days.

Or, if your 40 users were dual homed to two 48 port stacked switches (assuming the stack worked as it should), your users shouldn't see an outage at all (although they might notice a service degradation).

So again, I'm only suggesting, efforts to determine MTBF to n^th degree might be better devoted to other areas of design.

stuartkendrick · ‎03-07-2017

Ahh, I see. Yes, you've changed my mind; I *am* interested in MTBF figures for power supplies now (not published for most gear, regrettably).

OK, I'm beginning to glimpse how hard this is. Minimally, I need to know a lot about those MTBF guesstimates are calculated ... that's hard information to dig up, as far as I can tell. Plus, I would need more MTBF figures, minimally around power supplies.

BTW: The point of the exercise isn't actually improving uptime; it is helping mgmt understand what they win (and lose) as they traverse the cost chain, from Cat2960X to Cat35XX to Cat4500. They don't like the fact that a lot of our time goes to fixing Cat2960X problems; they want us spending our time elsewhere. As low-tech folks, they have been, quite naturally, focused on acquisition cost; I want to help them make a more nuanced trade-off, between acquisition cost and on-going staff time, as we make decisions on what to buy for the next building.

I think I'll end up doing something like this:

* Stackable approach tends to save you $$ both on acquisition and SmartNet

* But you spend more on operations, as you've bought way more parts. For example, a chassis-based IDF contains (2) power supplies, while a Stackable based IDF (given our port densities) contains (8), given the Cat2960X (or 16, given a higher-value stackable). More parts, more failures. One can mitigate this by buying a Stackable model which sports dual-power supplies ... the power supply failure rate doubles, but the urgency of replacing a failed PS goes down.

* Similarly, trouble-shooting tends to be more complex, consume staff time, and inflict more end-user downtime (we seem to hit these "a single component affects the whole Stack" issues more than our Cisco Account Team believes is typical). Complex failures on chassis' tend to involve the Sup card -- and the fix tends to be fairly simple: replace the Sup card. [Although, the whole IDF is down during this process.] Complex failures on Stackables are hard ... there are (8) Members each with their own brains, (8) Stacking Cables, and (8) Stacking Modules, and binary search on an intermittent problem takes a lot of both calendar & staff time to resolve (although typically at least some ports / users are unaffected during this process).

* Perhaps reliability improves as one climbs the value chain in Stackables (Cat2960X ... Cat3650... Cat3850) ... but I don't know how to assess this, so I won't go there.

Joseph -- thanx for helping me think through this. If you have further input, do let me know; else, I'll ride off to tilt at other windmills.

--sk

Joseph W. Doherty · ‎03-07-2017

Yup, what you say makes sense. I've often been involved in similar issues concerning using Cisco equipment or brand X. The latter often has a lower CAPEX, but OPEX be more than the CAPEX difference, especially when the business, or major portions of it, are impacted.

Understanding published MTBF numbers