Lecture 16: 3-Phase Commit

These topics are from Chapter 13 (Fault Tolerance) in Advanced Concepts in OS.

Topics for Today

review of 2-Phase Commit
3-Phase Commit

2-Phase Commit Blocking Problem

If a site fails, other sites may block until it recovers and completes its role in the protocol.

For example, if the coordinator fails in state w₁, after sending COMMIT_REQUEST, the sites will be stuck waiting for the coordinator to follow up with an abort or commit message until the coordinator recovers.

What happens in other failure cases? For example, suppose a cohort fails in state w_i?

Compare the impact in terms of locking effects if one cohort fails versus if the coordinator fails.

Nonblocking Commit Protocols Needed

objective: site failures should not cause blocking of other sites
operational sites agree immediately on one action
failed sites later perform the same action
independent recovery = can decide how to recover based solely on local state at the time the failure is detected

How to get there?

Analysis: What Causes Blocking in the 2-Phase Commit?

assume point-to-point communication is reliable
assume failure of a communication partner's site can be detected, via the network

How do we achieve reliable point-to-point communication?

How do we detect failure of a site?

Terminology

Concurrency sets are an abstraction of what one site knows about the possible states of other sites.

Concurrency Set
C(s_i) = { s_j | s_j may be concurrent with s_i }
Sender Set
S(s) = { i | in state s we may receive a message from site i }

Concurrency Set Examples: 3 sites

Suppose site 1 initiates the commit protocol, and sites 2 and 3 respond.

C(q₁) = { q₂, q₃ }

Click here for diagram

C(q₂) = { q₁, w₁, q₃, a3, w₃ }

Click here for diagram

Note that we cannot have a₁ in C(q₂), since site 1 must wait for responses from all of the other sites before it makes the transition from state w₁.

C(w₁) = { q₂, a₂, w₂, q₃, a₃, w₃ }

Click here for diagram

C(w₂) = { w₁, a₁, c₁, q₃, a₃, w₃ }

Click here for diagram

Note that C(w₂) contains both an abort state and a commit state for site 3. This means that it is unsafe at this point for site 2 to take any independent recovery action, because site 3 might choose a different action. For this reason, site 2 must block until it receives a message from the coordinator.

C(w₂) = { w₁, a₁, c₁, q₃, a₃, w₃ }

Click here for diagram

C(c₁) = { w₂, c₂, w₃, c₃ }

Click here for diagram

C(f₁) = { b₂, c₂, b₃, c₃ }

Click here for diagram

Match these up with the state diagrams above, and see why the sets contain the elements they do.

Conditions that Cause Blocking

If C(s_i) contains both commit and abort states, then site i cannot decide to abort the transaction, since some other site may be in a commit state.

It cannot commit, either, since some other site may be in the abort state.

Therefore, site i must block.

If a protocol contains a local state of a site with both abort and commit states in its concurrency set, then under independent recovery conditions it is not resilient to an arbitrary single failure.

Simplified FSM Model of the 2-Phase Commit Protocol

The state diagrams in the text are a further abstraction from the ones shown above, with fewer states. The relationship is shown in the picture below.

The state f₁ and the transitions to it are eliminated, states a₁ and c₁ are made into final states, and the states a_i and b_i are merged.

Conceptually, the elimination of state f₁ amounts to modifying the protocol so that the coordinator does not block to wait for ACK messages.

One can then argue that merging states a_i and b_i is an allowable further simplification, since the only effect of the transition from a_i to b_i is to send the ACK that is now ignored.

The simplified diagram is no longer a complete description of a fault-tolerant protocol. Without the ACK messages from everyone, the Coordinator does not know that the Cohorts have caught up, and so cannot safely go on with its next computation.

However, the simplified diagram does make a clearer separation between abort states and commit states, which is the main focus of our interest.

Therefore, we will follow the textbook by using the simplified diagram in the analysis of whether the protocol permits independent recovery from failures below. Alternate diagrams are provided at some points, via links.

Simplified State Diagram used in Text

Concurrency Sets

3-Phase Commit Protocol

The 3-phase commit protocol splits state w_i, thereby eliminating the problem of having both abort and commit states in the concurrency set of state w₁.

Concurrency Sets with 3-Phase Commit

As with the 2-Phase Commit, the state diagrams in the textbook for the 3-Phase Commit Protocol are simplified. There is no Coordinator state to receive the ACK messages generated when the Cohort makes the transition from w_i to a_i.

The following version includes the full state set.

Failure & Timeout Rules

if C(s) contains a commit, then add failure transition to a commit state;
otherwise, add failure transition is to an abort state
if j is in S(s) and j has failure transition to commit (abort) state then add timeout transition to commit (abort) state

3-Phase Commit Protocol with Failure Transitions

3-Phase Protocol Theorem

Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction.

Multiple Site Failure Theorems

There is no protocol using independent recovery that is resilient to arbitrary failures by two sites.
There is no protocol resilient to network partitioning when messages are lost.
There is no protocol resilient to multiple network partitionings.

Lecture 16: 3-Phase Commit These topics are from Chapter 13 (Fault Tolerance) in Advanced Concepts in OS.

Lecture 16: 3-Phase Commit

Topics for Today

2-Phase Commit Blocking Problem

Nonblocking Commit Protocols Needed

Analysis: What Causes Blocking in the 2-Phase Commit?

Terminology

Concurrency Set Examples: 3 sites

Conditions that Cause Blocking

Simplified FSM Model of the 2-Phase Commit Protocol

Simplified State Diagram used in Text

Concurrency Sets

Concurrency Sets

3-Phase Commit Protocol

Concurrency Sets with 3-Phase Commit

Failure & Timeout Rules

3-Phase Commit Protocol with Failure Transitions

3-Phase Protocol Theorem

Multiple Site Failure Theorems

Lecture 16: 3-Phase Commit

These topics are from Chapter 13 (Fault Tolerance) in Advanced Concepts in OS.