Talk:Myth of the nines

From Wikipedia, the free encyclopedia

"Bus" network topology This article is part of WikiProject Computer networking, an attempt to build a comprehensive and detailed guide to Computer networking on Wikipedia. If you would like to participate, you can edit the article attached to this page, or visit the project page, where you can join the project and/or contribute to the discussion.
??? This article has not yet received a rating on the assessment scale.

Articles for deletion This article was nominated for deletion on June 29, 2006. The result of the discussion was keep.

Contents

[edit] Inconvenient outages

"Another factor is that computer systems do actually tend to experience outages at the most inconvenient time, when the system is most heavily used." Source? Crid 07:36, 16 December 2005 (UTC)

Murphy's law?
Seriously, though, it's rather obvious that systems under heavy load are more prone to failure than systems under light load, because heavy loads exercise boundaries of system tolerances (they run hotter, consume more memory, use more disk space, process many different inputs that may trigger pathological cases in software, etc.) Furthermore, a system under heavy load is most probably under heavy load because there's a lot of work to be done, so an outage at these times is more inconvenient than an outage at a time when the system is picking its nose.
Per Wikipedia:Guide to writing better articles#State the obvious, we could put a note in to this effect if this line of reasoning is not obvious. JRM · Talk 01:24, 22 December 2005 (UTC)
The insight is more implicit than obvious... The word "inconvenient" is what bugs me. Boolean logic is not concerned with what we feel is tolerable. Murphy's law is an attitude problem. But computer science is inherently technical and seems like black magic to many people. (That's why I got to this page.) So perhaps superstitious rhetoric should be avoided. Anybody who pursues 5 nines for a living is probably all about the real world. Anyway, I trust your judgment. Crid 01:24, 25 December 2005 (UTC)
For completeness, the mention of Murphy's law wasn't supposed to be a rational explanation... I see what you mean, though, the sentence seems to appeal to a sort of mystic intuition about "how these things work" which is really not necessary; there are perfectly objective explanations for why systems fail at those times humans would consider most inconvenient.
What's worse from a writer's point of view, however, is that the fact is irrelevant. It doesn't tie in with anything. It is not "another factor" (of what?), it's just a tidbit about subjective perception of failure. I've removed it on those grounds. JRM · Talk 00:31, 26 December 2005 (UTC)

[edit] What it means

delivering its service to the user 99.999% of the time it is needed.

No, no? I believe that 99.999% is a measure of the absolute number of minutes (or whatever) in a year; the phrasing we're using now isn't accurate for systems which must be usable as much as possible when the business is *not* in operation 24/7/365.25. --Baylink 18:13, 11 July 2006 (UTC)

[edit] Planned vs Unplanned downtime

I've heard someone say that contracts specifying nines of uptime typically exclude counting periods of "planned downtime", which could in fact extend into hours or days or longer as long as the outage is planned and scheduled. In such contracts, only "unplanned downtime" is usually considered to count against the alloted downtime. Can anyone else confirm this convention and maybe integrate into the article if this is indeed a common industry interpretation? -- Bovineone 20:59, 13 August 2006 (UTC)

The measurement is relative to expected available hours. If I promise to be available 7 hours a week, and I'm available 7 hours a week, I have 100% availability (I'm fulfilling the promise completely). Rational people recognise the need to allow for error, so the "nines" refers to the degree of error that's expected. The fewer hours you promise to be available, the easier it is to meet the goal - a few systems (like most phone systems, for instance) promise to be available all the time, and have a goal of meeting that promise 99.999% of the time. That is the assumption implicit in the "myth of the nines". But if you have a service level agreement with a hosting site that allows for them to be down on Sunday afternoons for maintenance, then you don't get to subtract Sunday afternoon service interruptions from their performance, as they didn't promise it in the agreement. Lesson: always read the fine print! -- not a user... —Preceding unsigned comment added by 199.90.35.11 (talk) 21:20, 29 May 2008 (UTC)

[edit] Mathematics

Just to note explicitly a trivial point when calculating probability. We've all implicitly used the approximation that:

 (1 + x)^n   =  1 + n.x       #for small x.

This is a very accurate approximation, but its use should be noted.

[edit] Costs

I think a reference to the costs of a [Network outage] would be interesting. However, that article is currently in such a bad state that i do not want to add it here. And i do not have the necessary information to extend it. Dbu 13:50, 22 January 2007 (UTC)

[edit] Sales pitch

More often than not five & six nine availability figures are generously used by the sales and marketing teams to push a product. It is advised that buyers reconfirm as to how and what duration the system monitored before declaring these attributes to the given system.

[edit] There is nothing "mythical" about the nines

The calculation of the uptime as a percentage is widely used in SLAs so the article is providing some useful information.

I would like to add that there is no technical reason for it to be all nines. The expression of "five nines uptime" may be popular but you could express the uptime as 99.8% or 99.98% or whatever, albeit with less geek appeal.

Anyway, under the section entitled The myth explained the writer is confusing the technical issues with the business issues. There is nothing misleading or controversial about guaranteeing your uptime as a percentage and stating the remedies for failing to meet the guarantee. The collateral damage caused by downtime should be well understood by the customer, but the topic for such discussions should be "risk management", not uptime calculations.

As a service provider, it is not up to me to measure the impact that a system failure will have on individual customers. They are responsible for measuring that impact on their own businesses. I offer a percentage of uptime for a given price. They then decide if my guarantee is acceptable. If it is not, I can offer a higher percentage at a higher price. The guarantee is only one side of the equation, though. The other side is the remedies to the customer if I fail to meet the terms of my uptime guarantee. This should also be covered in the SLA.

I think the article should be renamed, removing the reference to "myth" and that subsection reworked, perhaps into something that examines the uses of uptime percentages in business risk management.

Further, I would like to see the article expanded to include more info concerning probability calculations. For example, I know that a single server with no redundancy will have a downtime of no less than 2 hours if the hard disk fails. (The presumption is that 2 hours is about how long it will take to replace the disk and restore from backups) But what's the probability that it will fail more than once a year? Or how do I calculate the uptime probability if I use use two servers in a cluster? Or three? That is the kind of information that I came to this article looking for but didn't find.

Stephen Brown 13:03, 20 April 2007 (UTC)

It's really awesome to calculate probabilities. It depends on whether the events are independent or not. Here, it depends if the servers act as full redundant backups for each other, or whether they are connected in full or part in a critical way. Let's assume that they are full redundant backups. If the uptime is 80% (it will illustrate things more clearly) on each server, then the probability of failure for each is .2. For the system to go down you will need both to fail concurrently. For this will multiply percentages 0.2 x 0.2 gives us 0.04 chance of failure which means 0.96 or 96% uptime. If we substitute 99.999% we get 99.99999999% (10 nines). This is a simple and idealised situation. If the systems are wholely dependent, for example in a cluster situation where the system will fail if any component server fails, then the opposite will occur. So for 80% uptime we need all our servers up for the system to be up, so 0.8 uptime for each server, and for three servers, 0.8 x 0.8 x 0.8, which gives 0.512 or 51.2% uptime for the system. Substituting 99.999% we get 99.99700003% (note above where small x etc). But if you need an article describing how to calculate probabilities, then I don't think this is the place.

For the other part, the chance of more than one failure in a year? We need more information than just the percentage uptime. We would need to know the average length of each downtime. But this gets a little tricky, because the downtime will be dependant on the component. A failed PSU in a hot swap box, might only take seconds to swap out and minutes for a reboot, but a failed CPU or mainboard will take longer to fix, particularly if you need to run diagnostics first. Given that each component will have different failure rates, and different associated downtimes, you need to add up a lot of numbers. But the 99.999% can't give you this info.

So in answer to the questions: this info isn't here because it don't have nothing to do with the article. 58.107.72.85 (talk) 00:21, 28 January 2008 (UTC)

[edit] One tenth as good

The article states: "For example, a system of ten components (eg. disks, motherboard, PSU, RAM, mains-power, network...), each with 99.999% availability, only has 99.99% overall availability (ie. one tenth as good)." This seems to be wrong. 99.99 isn't one tenth of 99.999. It's only one-thousandth less than 99.999. -18:23, 22 August 2007 (UTC)

Subtract the percentages from 100% and you'll get it. 0.0001% is ten times less (failures) than 0.001% — so ten times better. -- intgr #%@! 21:15, 22 August 2007 (UTC)
The fact is, that the overall availability is not equal to “one tenth” of something, the calculation uses tenth power (the resulting availability is 0.9999910), but the result is almost the same (~99.9900004+), because (1 - \varepsilon)^n \approx 1 - n\varepsilon for small \varepsilon. --Mormegil 11:44, 26 October 2007 (UTC)