These topics are from Chapter 13 (Fault Tolerance) in Advanced Concepts in OS.
If a site fails, other sites may block until it recovers and completes its role in the protocol.
For example, if the coordinator fails in state w1, after sending COMMIT_REQUEST, the sites will be stuck waiting for the coordinator to follow up with an abort or commit message until the coordinator recovers.
What happens in other failure cases? For example, suppose a cohort fails in state wi?
Compare the impact in terms of locking effects if one cohort fails versus if the coordinator fails.
How to get there?
How do we achieve reliable point-to-point communication?
How do we detect failure of a site?
Concurrency sets are an abstraction of what one site knows about the possible states of other sites.
Suppose site 1 initiates the commit protocol, and sites 2 and 3 respond.
|
|
Note that we cannot have a1 in C(q2), since site 1 must wait for responses from all of the other sites before it makes the transition from state w1.
|
|
Note that C(w2) contains both an abort state and a commit state for site 3. This means that it is unsafe at this point for site 2 to take any independent recovery action, because site 3 might choose a different action. For this reason, site 2 must block until it receives a message from the coordinator.
|
|
|
Match these up with the state diagrams above, and see why the sets contain the elements they do.
If C(si) contains both commit and abort states, then site i cannot decide to abort the transaction, since some other site may be in a commit state.
It cannot commit, either, since some other site may be in the abort state.
Therefore, site i must block.
If a protocol contains a local state of a site with both abort and commit states in its concurrency set, then under independent recovery conditions it is not resilient to an arbitrary single failure.
The state diagrams in the text are a further abstraction from the ones shown above, with fewer states. The relationship is shown in the picture below.
The state f1 and the transitions to it are eliminated, states a1 and c1 are made into final states, and the states ai and bi are merged.
Conceptually, the elimination of state f1 amounts to modifying the protocol so that the coordinator does not block to wait for ACK messages.
One can then argue that merging states ai and bi is an allowable further simplification, since the only effect of the transition from ai to bi is to send the ACK that is now ignored.
The simplified diagram is no longer a complete description of a fault-tolerant protocol. Without the ACK messages from everyone, the Coordinator does not know that the Cohorts have caught up, and so cannot safely go on with its next computation.
However, the simplified diagram does make a clearer separation between abort states and commit states, which is the main focus of our interest.
Therefore, we will follow the textbook by using the simplified diagram in the analysis of whether the protocol permits independent recovery from failures below. Alternate diagrams are provided at some points, via links.
The 3-phase commit protocol splits state wi, thereby eliminating the problem of having both abort and commit states in the concurrency set of state w1.
As with the 2-Phase Commit, the state diagrams in the textbook for the 3-Phase Commit Protocol are simplified. There is no Coordinator state to receive the ACK messages generated when the Cohort makes the transition from wi to ai.
The following version includes the full state set.
Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction.