Reliable system design

From Wikipedia, the free encyclopedia

Reliable system design is the design of systems with high levels of reliability and availability.

There is no such thing as a perfectly reliable system. Reliability engineering cannot engineer out failure modes which are not anticipated by modelling. For this reason reliable systems are specified at and designed to some non-zero failure rate. Typical rates include "five nines" (99.999% availability) and "six nines" (99.9999% availability). Some life-critical systems are designed to even higher levels of performance.

Engineers attempt to create reliable systems by design, rather than by over-designing without a particular goal in mind. The analytical tools for reliable systems design are root cause analysis and threat tree analysis. These allow real-world system failures to be investigated, and the failure modes of new systems modelled.

The main engineering approaches toward reliable systems design are:

  • eliminating single points of failure ("no single point of failure").
  • engineering any remaining single points of failure to whatever level is necessary to reach the system specification.
  • adding extra system safety margins to allow for errors in modelling or implementation.

Single point of failure describes any part of the system that can, if it fails, cause an interruption of required service. This can be as simple as a process failure or as catastrophic as a computer system crash.

Most non-critical real-world systems have many single points of failure. A typical desktop computer has only one processor, one power supply, one keyboard, and one screen, the failure of any of which will render that computer unusable. A business as a whole generally conducts its affairs so that the failure of any single desktop PC will not bring the business down. Thus, the components mentioned above are single points of failure for the PC, but not for the larger system of which the PC is a component. Similar techniques using duplicated and backup components create resilient systems for critical applications such as databases, communications networks and air traffic control systems.

However, mere use of massive redundancy does not make a system reliable, so long as there is even one single point of failure left in the system. A network where power feeds, network connections, routers, and router interconnections have all been correctly made redundant can still have a single point of failure if both routers are housed in a single rack, allowing a single spilled cup of coffee to take out both routers at once. Even eliminating every conceivable single point of failure — a so-called shared-nothing system — is not by itself enough to make a system truly resilient, as the extra redundancy may make the system vulnerable to Byzantine failure modes.

[edit] See also

In other languages