Myth of the nines

From Wikipedia, the free encyclopedia

In information technology, the myth of the nines is the idea that standard measurements of availability can be misleading. Availability is sometimes described in units of nines, as in "five nines", or 99.999%. Having a computer system's availability of 99.999% means the system is highly available, delivering its service to the user 99.999% of the time it is needed.

[edit] How to calculate five nines

The number N of nines describing a system which is available a fraction A of the time is

N = - \log_{10}(1 - A).\,

In general, the number of nines is not often used by engineers when modeling and measuring availability, because it is hard to apply in formulae. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is more often seen in marketing documents, presumably because it looks impressive.

The following table tries to elaborate the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime in order to calculate service credits to match monthly billing cycles.

Availability % Downtime per year Downtime per month* Downtime per week
90% 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 min
99.8% 17.52 hours 86.23 min 20.16 min
99.9% 8.76 hours 43.2 min 10.1 min
99.95% 4.38 hours 21.56 min 5.04 min
99.99% 52.6 min 4.32 min 1.01 min
99.999% 5.26 min 25.9 s 6.05 s
99.9999% 31.5 s 2.59 s 0.605 s

Note that for monthly calculations, a 30-day month is used. This model does not take into account the impact that an outage would have on business if it occurred at a critical moment.

[edit] The myth explained

The myth of the nines is the implicit assumption that if the computer is operating 0.99999 of the time, then the user's business is operating 0.99999 of the time. In fact, this is often far from the truth. After an outage, the humans using the computer have to scramble to catch up, perhaps apologising to customers, calling them back, entering data written down with ink and paper during the outage, and other unfamiliar chores. In the case of a drive failure, the server downtime might be small, but the time to restore from backup might be considerably longer. A computer outage of a minute might cause a business outage of hours.

A further assumption in this model is that ten outages of one minute each have the same effect on the user as one outage of ten minutes. Again, this is not usually true. If a system is experiencing repeated outages, the user is justified in believing that the system cannot be trusted. In this case, the user may regard the computer as a liability. The user may measure ten one-minute outages over a period of six months as a downtime of six months, while the computers manufacturer measures it as a downtime of ten minutes. There is no way to calculate the number of outages over a given period from the uptime percentage alone.

Also, failure probabilities are not always independent. For example, a system made up of five-nine components does not have five-nine availability: each component must be up for the system to be up, and so the component uptimes must be multiplied to give the system uptime. A system of ten components (e.g., disks, motherboard, PSU, RAM, mains-power, network...), each with 99.999% availability, only has approximately 99.99% overall availability. A system might also be made of redundant parts, such that failure events are independent. This would have the effect of increasing up time, for the system to fail, two or more components need to fail. The best example of this is the RAID, where normally 2 or more drives need to fail for the RAID to fail. In this case, a system such as a level 5 RAID, 2 drives need to fail, and for 99.999 disks, the RAID uptime would be 99.99999999%. There are more complicating factors in redundant systems where the redundant components need to fail within the time it takes for the 1st component to be replaced.

Lastly, in many cases, "scheduled maintenance" is not included within the reliability calculation. So if the computer must be taken down to replace a failing disk, but the downtime is notified a week in advance, it "doesn't count".

[edit] External links