These topics are from Chapter 8 (Agreement Protocols) in Advanced Concepts in OS.
Assume:
Agreement problem is not solvable in an asynchronous system, even for single-processor failures.
Synchronous model allows detection of first two kinds of failures.
Byzantine failures may be due to hardware or software failures, or due to malicious attacks.
All non-faulty processors must agree on value(s) from a non-faulty processor
Byzantine agreement is the most basic one.
Algorithms to solve the other problems can be constructed from an algorithm to solve the Byzantine agreement problem, though more direct algorithms may also exist.
We will see some algorithms for solving the Byzantine agreement problem that fall within these bounds. However, we will also see that the algorithms are fairly complex. This should naturally lead one to think twice when designing a system, to see if there is a way to avoid creating situations that require agreement.
See the following simple example with 3 processors, from text. The arrows indicate state information made available to other nodes. In the first case, processor A initiates the agreement protocol and processor B is maliciously faulty.
C sees that B has decided for 0 and A has decided for 1. To satisfy the Byzantine agreement problem, C must decide for 1, since A is not faulty and A has decided for 1. This implies that the algorithm followed by C (and hence by any non-faulty non-initiating processor) must break ties in favor of the initiating processor.
The next case is where the processor A is a traitor, and reports different values to B and C.
B thinks A has decided for 0 and C thinks A has decided for 1. If the algorithm breaks ties in favor of the initiator, C must decide for 1. However, B must follow the same algorithm, and so it must decide for 0. This means we have no agreement among the two nonfaulty processors.
Proof of the full theorem generalizes this reasoning to a larger number of processors.
This is called the ``Oral Message'' algorithm, because the conditions correspond to what we would expect if messages are delivered orally, in person, by pairwise conversations between the parties involved in the consensus.
If there are no traitors, achieving agreement is easy:
S is the set of generals for which we want agreement.
The commander i sends a value v directly to every lieutenant j Î S - {i}.
For each lieutenant j Î S - {i}, let vj be the value lieutenant j receives from the commander i, or else be RETREAT of he receives no value. Lieutenant j initiates OM(m-1, S - {i}) (recursively) with value vj, acting as commander.
The notation vj here helps us to remember that j received the value vj from i in the previous round, and j is asking the other generals to agree on this fact. At the end of each of these recursive executions, all every loyal lieutenants j Î S - {i} has agreed on a set of pairs (k,vk), one for each k Î S-{i}.
When Step 2 has been completed by all lieutenants, each lieutenant j tabulates the pairs it received in Step 2 (its own pair containing the original value from its commander and the other pairs containing the values returned by its own lieutenants by the recursive invocation of OM(m,S-{i})) and agrees on the value v = majority ({(k,vk) | k Î S -{i}}) that is in the majority of those pairs, to be the result of OM(m, S).
One feature of this algorithm that some people have found confusing is the way in which the results of the recursive algorithms are combined. That is, the values must be retained and then combined, by taking the majority, after the entire round has completed.
Another feature that some people have found confusing is that there must be an arbitrary rule, such as choosing the lower value, is to break ties. Since traitors may not send messages, there also must be a default value, such as 0, that is used for all generals from which no pair is received. Likewise, if there is no majority, a default value must be used for the result of OM(m,S). So long as all loyal generals agree on the tie-breaking rule and the default value, there will still be consensus among the loyal generals.
To understand this algorithm, it helps to start with the case that the commander i is loyal. In that case, each lieutenant j will receive the same value v from i. The loyal ones can simply accept the value v and it will not matter what the traitors do.
However, since there is no way for a lieutenant j to tell whether the commander i is traitor, one must assume that he may be a traitor. To protect against the commander sending different values to the different lieutenants, the lieutenants must hold a ballot to reach consensus on what message the commander sent to each one of them. The rest of the algorithm is the procedure for that ballot.
Since the messages are transmitted "orally" (not broadcast), the lieutenants must all exchange information about what they received in the previous round, before they can hold the ballot. The ballot would still be easy if we could trust every processor to report accurately what it received. However, we must allow fo the possibility that some lieutenants are traitors, and so will report different things to different other lieutenants. That is why we need to do a Byzantine agreement on each of the messages that was sent to a lieutenant in the previous round.
When we get to the recursive invocation of OM(m-1,S-{i}), it is not obvious that we have reduced the problem sufficiently to satisfy the preconditions for OM(m-1,S-{i}). There are two possibilities:
The second case is dealt with by the Validity Lemma, which is stated and proven below. This lemma guarantees that if the commander is loyaal, O(m,S) can tolerate up to k traitors if | S | ³ 2k+m. We will explain this lemma in more detail below, using the original theorems and proofs of Lamport, Shostak, and Pease.
Lemma: For any m and k, OM(m,S) satisfies the Validity Condition if there are more than 2k+m processors and at most k of them are traitors.
Proof:
The proof is by induction on m. As a basis for the induction, we consider the case of OM(0). The Validity Condition only specifies what must happen if the commander is loyal. It is easy to see that if the commander is loyal OM(0) satisfies the Validity Condition, since all the processes get the same value v and agree upon that. We therefore can assume the theorem is true for OM(m-1) and prove that is tis true for OM(m), m > 0.
For the induction step, we have m ³ 1. In Step 1, the loyal commander i sends a value v to all the other processors. At Step 2, each loyal lieutenant j applies O(M-1,S-{i}). Since we are assuming that | S | > 2k + m, we have | S -{i} | > 2k + (m-1), so we can apply the induction hypothesis to conclude that every loyal lieutenant agrees on the value vj=v for each invocation of OM(m-1,S-{i}) by a loyal commander j. Since there are atmost k traitors, and | S -{i} | > 2k + (m-1) > 2k, a majority of the lieutenants in S -{i} are loyal. Hence, when each lieutenant gets to Step 3 it will find a majority of the other lieutenants support the value v, and so it will agree to the value v. This confirms the Validity Condition.
Theorem: For any m, OM(m,S) satisfies the Validity and Agreement Conditions if there are more than 3m generals and at most m of them are traitors.
Proof:
The proof is by induction on m, similar to that of the Validity Lemma. As a basis for the induction, we consider the case of OM(0). If there are no traitors, it is easy to see that OM(0) satsfies the Validity and Agreement Conditions. We therefore can assume the theorem is true for OM(m-1) and prove that it is true for OM(m), m > 0.
For the induction step, have m ³ 1. We consider two cases, depending on whether the commander is a traitor.
Do an example, for 4 processors, interactively
Round 1: processor A executes OM(1), where processor C (in red) is faulty.
Round 2: processors B, C, and D execute OM(0). Dashed lines indicate messages sent during the previous round.
Round 1: processor A executes OM(1), where processor A is faulty.
Round 2: processors B, C, and D execute OM(0).