Lecture #14: Recovery
Recovery in Distributed/Concurrent Systems
- Lost messages, orphane messages, and livelock
- Strongly Consistent Set of Checkpoints
- Consistent Set of Checkpoints
A Simple Method for taking a Consistent Set of Checkpoints
- Assumption: checkpoint, send/recv are atomic.
- Take a checkpoint after sending every message.
- The set of the most recent checkpoints is always consistent. Why?
Is it strongly consistent?
- What is the main problem with this approach? Take a checkpoint after every
K messages sent? Is it still consistent?
Synchronous Checkpointing Algorithm (Koo and Toueg)
- make some simplifying assumptions
- processes communicate by exchanging messages through channels
- channels are FIFO, end-to-end protocols cope with message loss due to
rollback recovery.
- comm. failures do not partition the network
- use 2 kinds of checkpoint
Synchronous Checkpointing: Phase 1
- initiator: take tentative checkpoint
ask other processes to take tentative checkpoint
- other processes: can respond `yes' or `no'
- initiator: decide to make checkpoints permanent
if everyone has responded `yes'
Synchronous Checkpointing: Phase 2
- initiator: inform all processes of Phase 1 decision
(commit or abort checkpoint)
- others: act accordingly
Between tentative checkpoint and commit/abort of
checkpoint process must hold back messages.
Does this guarantee we have a strongly
consistent state? Can you construct an example that
shows we can still have lost messages?
Synchronous Checkpointing: Properties
- all or none of the processes take permanent checkpoints
- there is no record of a message being received but not sent
Checkpoints may be taken unnecessarily, give an example.
Can this unnecessarily checkpoints to avoided? A scheme is described
in the book. Main idea
- Record all messages sent and received after the last checkpoint.
(last_recv(x, y), first_sent(x, y))
- When X request Y to take a tentative checkpoint, X send the the last
message received from Y with the request. Y takes tentative checkpoint only
if the last message received by X from Y was sent after Y sent the first
message after the last checkpoint (last_recv(x, y) >= first_send(y,x)).
- When a process takes a checkpoint, it will ask all other
processes that sent messages to the process to take checkpoints.
Rollback Recovery: Phase 1
- initiator: check whether all proceses are willing to
restart from last checkpoints
- others: may reply `yes' or `no'
Rollback Recovery: Phase 2
- initiator: propagate go/nogo decision to all processes
- others: carry out the decision of the initiator
Between request to rollback and decision, no one sends
other messages
Rollback Recovery: Properties
- all or none of the processes restart from checkpoints
- after rollback, all processes resume in a consistent state
Can have unnecessary rollback: can use a similar technique as
the one in taking checkpoints to eliminate unnecessary rollback. Discuss
Disadvantages of Synchronous Approach
- checkpoint algorithm generates message traffic
- synchronization delays are introduced
These costs may seem high if failures between checkpoints
are unlikely.
Asynchronous Approach
- Take multiple local checkpoints independently
- After a failure, try to find a consistent set of checkpoints
among those that have been taken recently
- All incoming messages between local checkpoints are logged
- pessimistic approach: log each message before processing
- optimistic approach: buffer messages & log in batches
Why is the second approach called optimistic?
What are the advantages and disadvantages of each approach?
Juang & Venkatesan Asynchronous Checkpointing Algorithm
make some simplifying assumptions
- communication channels are reliable
- communication channels are FIFO
- communication channels have no buffer size limits
- message transmission delay is bounded
- underlying system is event-driven, with locally timestamped (monotonically increasing numbers) events. Each event consists of the following: waiting for
a message, process the message, change process state, and send a number of messages.
basic idea:
- At each event, a triplet {s, m, msgs_sent} is put in the the log. s is the
state, m is the message causing the event, msgs_sent is the set of messages sent.
- Two data structures used: RCVD(i, j, checkpoint) -- the number of
message received by processor i from processor j at checkpoint, SENT(i, j,
checkpoint) -- the number of messages sent from i to j at checkpoint.
- Use the message send/recv counts to determine the point to rollback.
Algorithm: all node will be running the same recovering algorithm (how to make
this happen?)
At processor i:
- If I is a processor that is recovering from a failure, checkpoint = the latest event logged in the stable storage.
- else checkpoint = latest event that took place.
- for k = 1 to N do
- send ROLLBACK(i, SENT(i, j, checkpoint)) to all neighbors j
- wait for ROLLBACK messages from all neighbors
- for every ROLLBACK(j, c) received
- if (RCVD(i, j, checkpoint) > c) then
- find the latest event e such that RCVD(i, j, e) = c
- checkpoint = e
In each iteration, at least one processor will rollback to its final
recovery point unless current recovery point is consistent
UNIX file system and file system error recovery(fsck)
Unix Filesystem Structure
Information from M. J. Bach's Design of the Unix Operating System.
For fault tolerance, redundant copies of the superblock are
stored, on different cylinders and different platters of the disk
drive. This reduces the chance that a disk media failure will
result in corruption of the entire file system.
Contents of Superblock
- the size of the filesystem
- the number of free blocks in the file system
- a list of free blocks available on the file system
- the index of the next free block in the free block list
nnn- the size of the inode list
- the number of free inodes in the file system
- a list of free inodes in the file system
- the index of the next free inode in the free inode list
- lock fields for the free block and free inode lists
- a flag indicating that the super block has been modified
Contents of an Inode
- file owner IDs (individual and group)
- file type (regular, directory, character special, block special, FIFO)
- file access permissions
- file access times (file modified, file accessed, inode modified)
- incoming link count
- table of contents (disk addresses of direct and indirect blocks -- see below)
- file size
Relationship of inodes, Direct Blocks, and Indirect Blocks
Disk Error Recovery: fsck
The program fsck checks a filesystem for inconsistencies,
and then attempts to repair them.
- block belongs to more than one inode
- block belongs to an inode and the list of free inodes
- block is not on free list and not in a file
- non-zero link count but not in any directories
- free inode found in directory
- in general: more/less directory links than link count
- format of inode incorrect
- count of free blocks in super block does not match the number on disk
- count of free inodes in super block does not match the number on disk
How could each of these situations arise?
How might each of these situations be repaired?