Lecture 17: Voting Protocol
These topics are from Chapter 13 (Fault Tolerance) in Advanced
Concepts in OS
Topics for Today
Voting Protocols
- replicated data, at multiple sites
- each site has some number of votes
- access to replicated data requires a majority of votes
- votes determine which version is the current one
Static Voting
- replicated data, at multiple sites
- each file access requires obtaining a lock
- reader-writer locks are supported
- every site has a lock manager
- every file has a version number = number of changes made
- each replica has some number of votes
- vote allocation is on stable storage
- reads and write require a quorum
Static Voting
For a read or write request initiated by site i:
- issue Lock_Request to local lock manager
- local lock manager eventually grants request and then
sends Vote_Request to all sites
at site j:
- on receipt of Vote_Request from i, issue Lock_Request to
local lock manager
- if the lock request is granted by the local lock
manager, send the version number
VNj of its replica and the number of votes Vj of its replica
to site i
at site i:
- after votes are in, perform quorum test
Read Quorum Test
Where P is the set of sites that replied.
Write Quorum Test
Where M = max{VNj | j Î P} is the largest version number
reported in the vote, and
Q = {j Î P | VNj = M } includes only the votes that
correspond to that version number
Voting Algorithm (continued)
at site i:
- If quorum test fails, issue Release_Lock to local manager
and all sites in P that returned positive votes
- If quorum test succeeds, check whether local copy is current.
If not, obtain a fresh copy from another site.
- For a read, just use the local copy.
- For a write, update the local copy.
Then update VNi and
send the updates and VNi to all the sites in Q.
- Issue a Release_Lock to local manager and all the sites in P
at other sites:
- on receiving update messages, update own local copies
- on Release_Lock, release all local locks
Vote Assignment
If v is the total number of votes,
we want to choose r and w such that
r + w > v and
w > v/2
Why?
Consequences
- no obsolete copies are updated due to a write operation
- the current local replicas have at least w votes
- every read quorum and write quorum overlap by at least one site
- there cannot be simultaneous writes on distinct sets of replicas
How many temporary site failures can we tolerate?
What happens when a site comes back on line after failing?
What happens if the network is partitioned?
Tuning Example
Site | Votes | Read Access Time |
1 | 1 | 75ms |
2 | 1 | 750ms |
3 | 2 | 750ms |
4 | 1 | 100ms |
If r=1 and w=5, the read access time is 75ms and the
write access time is 750 ms. Any single site failure will prevent
writes.
If r=3 and w=3, the access times are unchanged, but
writes are still possible with a single site failure.
If site 4 is more reliable, we can further improve
reliability by readjusting the votes as follows.
Site | Votes | Read Access Time |
1 | 1 | 75ms |
2 | 1 | 750ms |
3 | 1 | 750ms |
4 | 2 | 100ms |
Dynamic Voting Protocols
- Change the set of sites that can form a majority
- Change the distribution of votes
Dynamic Vote Reassignment
- number of votes per site changes
- two kinds:
- group consensus on new assignment
- autonomous reassignment, ratified by majority of sites
What are the strengths and weakenesses of each?
Autonomous Vote Reassignment
- each site i has vector Vi representing its belief
of the global vote assignment
- Vi[j] is how many votes i thinks j is
entitled to have
- each site i has version-number vector N_i
- Ni[j] is the version number of Vi[j]
- each site i has vector vi representing the votes
it has seen
- vi[j] is how many votes i sees j is
trying to cast
Vote Increasing Protocol
When site i wants to increase Vi[i]:
- send Vi and Ni along with new vote value x to all
communicating sites
- wait for a majority of sites to respond
- if a majority is collected, update Vi[i] to the
new value and increment Ni[i].
When site j receives a vote-increasing request from
site i with Vi, Ni, and x:
- Vj[i] = x
- Nj[i] = Ni[i] + 1
Vote Decreasing Protocol
When site i wants to decrease Vi[i]:
- set Vi[i] to the new value
- increment Ni[i]
- send Vi and Ni to the other sites
When site j receives a vote-decreasing request from
site i with Vi and Ni:
- Vj[i] = Vi[i]
- Nj[i] = Ni[i]
Vote Collecting Protocol
- for each reply Vj and Nj received by site i:
- vi[j] = Vj[j]
- if Vj[j] > Vi[j] or (Vj[j] < Vi[j] and Nj[j] > Ni[j]) then
Vi[j] = Vj[j]; Ni[j] = Nj[j]
end if;
- if site j did not respond to site i:
- Find k Î G such that Nk[j] = max {Np[j] | p Î G},
where G is the set of all sites that replied to i. That is,
find the site that has the latest information on the votes assigned to site
j.
- vi[j] = Vk[j]; Vi[j] = Vk[j]; Ni[j] = Nk[j]
Deciding the Outcome
Let K be the set of all sites, and G be the set
of sites that responded to the ballot.
Site i has a majority iff RCVD > TOT/2.
Vote Increasing Policies
The above all leaves open the question of when a site should
try to increase or decrease its vote.
This is normally done in response to detection of an apparent
failure.
- overthrow technique -- one site increases its vote
- alliance technique -- all active sites increase their votes