High availability

From Wikipedia, the free encyclopedia

High availability is a system design protocol and associated implementation that ensures a certain absolute degree of operational continuity during a given measurement period.

Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.

Contents

[edit] Planned and unplanned

A distinction needs to be made between planned downtime and unplanned downtime. Typically, planned downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Planned downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, planned downtime is usually the result of some logical, management-initiated event. Unplanned downtime events typically arise from some physical event, such as a hardware failure or environmental anomaly. Examples of unplanned downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches, or various application, middleware, and operating system failures.

Many computing sites exclude planned downtime from availability calculations, assuming, correctly or incorrectly, that planned downtime has little or no impact upon the computing user community. By excluding planned downtime, many systems can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and they have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements.

[edit] Percentage calculation

Availability is usually expressed as a percentage of uptime in a given year. (Shorter time periods can be used, but sites that pick artificially short measurement periods may be hiding latent problems in their systems which produce instability, leading to unplanned downtime. Longer periods are more rigorous.) In a given year, the number of minutes of unplanned downtime are tallied for a system; the aggregate unplanned downtime is divided by the total number of minutes in a year (approximately 525,600), producing a percentage of downtime; the complement is the percentage of uptime, which is what is typically referred to as the availability of the system. Common values of availability for highly available systems are:

  • 99.9% ≡ 43.8 minutes/month or 8.76 hours/year
  • 99.99% ≡ 4.38 minutes/month or 52.6 minutes/year
  • 99.999% ≡ 0.44 minutes/month or 5.26 minutes/year

It should be noted that uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage.

[edit] Measurement and interpretation

Clearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% "uptime." However, given the true definition of availability, the system will be approximately 99.897% available (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, while administrators might have a different (and probably incorrect, certainly in the business sense) perception. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic.

Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users (and journalists!) themselves, than systems which experience periodic lulls in demand.

[edit] Closely related concepts

Recovery time is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.

Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.

A service level agreement ("SLA") formalizes an organization's availability objectives and requirements.

[edit] System design for high availability

Paradoxically, adding more components to an overall system design can actually undermine efforts to achieve high availability. That's because complex systems inherently have more potential failure points and are more difficult to implement correctly. The most highly available systems hew to a simple design pattern: a single, high quality, multi-purpose physical system with comprehensive internal redundancy running all interdependent functions paired with a second, like system at a separate physical location. This classic design pattern is common among financial institutions, for example. The same basic design principle applies beyond computing in such diverse fields as nuclear power, aeronautics, and medical care.

[edit] See also

In other languages