Byzantine fault tolerance
From Wikipedia, the free encyclopedia
Byzantine fault tolerance is the name given to a sub-field of error tolerance research, inspired by The Byzantine Generals' Problem, which is a generalized version of the Two Generals' Problem.
The object of Byzantine fault tolerance is to be able to defend against a Byzantine failure, in which a component of some system not only behaves erroneously, but also fails to behave consistently when interacting with multiple other components. Correctly functioning components of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components. There are upper bounds on the percentage of traitorous or unreliable components, however.
Contents |
[edit] Byzantine failures
A Byzantine failure (or Byzantine fault) is an arbitrary fault that occurs during the execution of an algorithm by a distributed system. It encompasses those faults that are commonly referred to as "crash failures" and "send and omission failures". When a Byzantine failure has occurred, the system may respond in any unpredictable way, unless it is designed to have Byzantine fault tolerance.
These arbitrary failures may be loosely categorized as follows:
- a failure to take another step in the algorithm, also known as a crash failure;
- a failure to correctly execute a step of the algorithm; and
- arbitrary execution of a step other than the one indicated by the algorithm.
For example, if the output of one function is the input of another, then small round-off errors in the first function can produce much larger errors in the second. If the second function were fed into a third, the problem could grow even larger, until the values produced are worthless. Another example is in compiling source code. One minor syntactical error early on in the code can produce large numbers of perceived errors later, as the compiler gets out-of-phase with the lexical and syntactic information in the source program.
Steps are taken by processes, the abstractions that execute the algorithms. A faulty process is one that at some point exhibits one of the above failures. A process that is not faulty is correct.
The Byzantine failure assumption models real-world environments in which computers and networks may behave in unexpected ways due to hardware failures, network congestion and disconnection, as well as malicious attacks. Byzantine failure-tolerant algorithms must cope with such failures and still satisfy the specifications of the problems they are designed to solve. Such algorithms are commonly characterized by their resilience t, the number of faulty processes with which an algorithm can cope.
Many classic agreement problems, such as the Byzantine Generals Problem, have no solution unless t < n / 3, where n is the number of processes in the system.
The Two Generals' Problem is a specific case which assumes that processes are reliable but communication between processes is not reliable.
[edit] Origin
Byzantine refers to the Byzantine Generals' Problem, an agreement problem in which generals of the Byzantine Empire's army must decide unanimously whether or not to attack some enemy army. The problem is complicated by the geographic separation of the generals, who must communicate by sending messengers to each other, and by the presence of traitors amongst the generals. These traitors can act arbitrarily in order to achieve the following aims: trick some generals into attacking; force a decision that is not consistent with the generals' desires, e.g. forcing an attack when no general wished to attack; or so confusing some generals that they never make up their minds. If the traitors succeed in any of these goals, any resulting attack is doomed, as only a concerted effort can result in victory.
The Byzantine Generals problem describes a group of generals, each commanding a division of the Byzantine army, encircling a city. These generals wish to formulate a plan for attacking the city. In its simplest form, the generals must only decide whether to attack or retreat. Some generals may prefer to attack, while others prefer to retreat. The important thing is that every general agrees on a common decision, for a halfhearted attack by a few generals would become a rout and be worse than a coordinated attack or a coordinated retreat.
The problem is complicated by the presence of traitorous generals who may not only cast a vote for a suboptimal strategy, they may do so selectively. For instance, if nine generals are voting, four of whom support attacking while four others are in favor of retreat, the ninth general may send a vote of retreat to a few generals, and a vote of attack to the rest. Those who received a retreat vote from the ninth general will retreat, while the rest will attack (which may not go well for the attackers).
Byzantine fault tolerance can be achieved if the loyal (non-faulty) generals have a unanimous agreement on their strategy. Note that if the source general is correct, all loyal generals must agree upon that value. Otherwise, the choice of strategy agreed upon is irrelevant.
[edit] Solutions
Several solutions were originally described by Lamport, Shostak, and Pease in 1982. They began by noting that the Generals Problem can be reduced to solving a "Commander and Lieutenants" problem where Loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is Loyal. Roughly speaking, the Generals vote by treating each others' orders as votes.
- One solution considers scenarios in which messages may be forged, but which will be Byzantine Fault Tolerant as long as the number of traitorous generals does not equal or exceed one third. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the 1 Commander + 2 Lieutenants problem cannot be solved if the Commander is traitorous. The reason is, if we have three commanders, A, B, and C, and A is the traitor: when A tells B to attack and C to retreat, and B and C sends messages to each other, forwarding A's message, neither B nor C can figure out who is the traitor, since it isn't necessarily A - the other commander could have forged the message purportedly from A. It can be shown that if n is the number of generals in total, and t is the number of traitors in that n, then there are solutions to the problem only when n is greater than or equal to 3 times t + 1.
- A second solution requires unforgeable signatures (in modern computer systems, this may be achieved through public key cryptography), but maintains Byzantine Fault Tolerance in the presence of an arbitrary number of traitorous generals.
- Also presented is a variation on the first two solutions allowing Byzantine Fault Tolerant behavior in some situations where not all generals can communicate directly with each other.
[edit] See also
[edit] References
- Lamport, Shostak, and Pease (1982). "Byzantine Fault Tolerance" (PDF). ACM Transaction on Programming Languages and Systems.
- L. Lamport, R. Shostak, and M. Pease (July 1982). "The Byzantine Generals Problem". ACM Trans. Programming Languages and Systems 4 (3): 382-401.
[edit] External links
- Ocean Store replicates data with a Byzantine fault tolerant commit protocol.
- Byzantine Quorum Systems Quorum systems for Byzantine-fault tolerant replication.
- Practical Byzantine Fault Tolerance