Lecture #14: Recovery
Recovery in Distributed/Concurrent Systems

Lost messages, orphane messages, and livelock
Strongly Consistent Set of Checkpoints
Consistent Set of Checkpoints

A Simple Method for taking a Consistent Set of Checkpoints

Assumption: checkpoint, send/recv are atomic.
Take a checkpoint after sending every message.
The set of the most recent checkpoints is always consistent. Why? Is it strongly consistent?
What is the main problem with this approach? Take a checkpoint after every K messages sent? Is it still consistent?

Synchronous Checkpointing Algorithm (Koo and Toueg)

make some simplifying assumptions
- processes communicate by exchanging messages through channels
- channels are FIFO, end-to-end protocols cope with message loss due to rollback recovery.
- comm. failures do not partition the network
use 2 kinds of checkpoint
- tentative
- permanent

Synchronous Checkpointing: Phase 1

initiator: take tentative checkpoint
ask other processes to take tentative checkpoint
other processes: can respond `yes' or `no'
initiator: decide to make checkpoints permanent if everyone has responded `yes'

Synchronous Checkpointing: Phase 2

initiator: inform all processes of Phase 1 decision
(commit or abort checkpoint)
others: act accordingly

Between tentative checkpoint and commit/abort of checkpoint process must hold back messages.

Does this guarantee we have a strongly consistent state? Can you construct an example that shows we can still have lost messages?

Synchronous Checkpointing: Properties

all or none of the processes take permanent checkpoints
there is no record of a message being received but not sent

Checkpoints may be taken unnecessarily, give an example.

Can this unnecessarily checkpoints to avoided? A scheme is described in the book. Main idea

Record all messages sent and received after the last checkpoint. (last_recv(x, y), first_sent(x, y))
When X request Y to take a tentative checkpoint, X send the the last message received from Y with the request. Y takes tentative checkpoint only if the last message received by X from Y was sent after Y sent the first message after the last checkpoint (last_recv(x, y) >= first_send(y,x)).
When a process takes a checkpoint, it will ask all other processes that sent messages to the process to take checkpoints.

Rollback Recovery: Phase 1

initiator: check whether all proceses are willing to restart from last checkpoints
others: may reply `yes' or `no'

Rollback Recovery: Phase 2

initiator: propagate go/nogo decision to all processes
others: carry out the decision of the initiator

Between request to rollback and decision, no one sends other messages

Rollback Recovery: Properties

all or none of the processes restart from checkpoints
after rollback, all processes resume in a consistent state

Can have unnecessary rollback: can use a similar technique as the one in taking checkpoints to eliminate unnecessary rollback. Discuss

Disadvantages of Synchronous Approach

checkpoint algorithm generates message traffic
synchronization delays are introduced

These costs may seem high if failures between checkpoints are unlikely.

Asynchronous Approach

Take multiple local checkpoints independently
After a failure, try to find a consistent set of checkpoints among those that have been taken recently
All incoming messages between local checkpoints are logged
- pessimistic approach: log each message before processing
- optimistic approach: buffer messages & log in batches

Why is the second approach called optimistic?

What are the advantages and disadvantages of each approach?

Juang & Venkatesan Asynchronous Checkpointing Algorithm

make some simplifying assumptions

communication channels are reliable
communication channels are FIFO
communication channels have no buffer size limits
message transmission delay is bounded
underlying system is event-driven, with locally timestamped (monotonically increasing numbers) events. Each event consists of the following: waiting for a message, process the message, change process state, and send a number of messages.

basic idea:

At each event, a triplet {s, m, msgs_sent} is put in the the log. s is the state, m is the message causing the event, msgs_sent is the set of messages sent.
Two data structures used: RCVD(i, j, checkpoint) -- the number of message received by processor i from processor j at checkpoint, SENT(i, j, checkpoint) -- the number of messages sent from i to j at checkpoint.
Use the message send/recv counts to determine the point to rollback.

Algorithm: all node will be running the same recovering algorithm (how to make this happen?) At processor i:

If I is a processor that is recovering from a failure, checkpoint = the latest event logged in the stable storage.
else checkpoint = latest event that took place.
for k = 1 to N do
- send ROLLBACK(i, SENT(i, j, checkpoint)) to all neighbors j
- wait for ROLLBACK messages from all neighbors
- for every ROLLBACK(j, c) received
  - if (RCVD(i, j, checkpoint) > c) then
    - find the latest event e such that RCVD(i, j, e) = c
    - checkpoint = e

In each iteration, at least one processor will rollback to its final recovery point unless current recovery point is consistent

UNIX file system and file system error recovery(fsck)

Unix Filesystem Structure

Information from M. J. Bach's Design of the Unix Operating System.

For fault tolerance, redundant copies of the superblock are stored, on different cylinders and different platters of the disk drive. This reduces the chance that a disk media failure will result in corruption of the entire file system.

Contents of Superblock

the size of the filesystem
the number of free blocks in the file system
a list of free blocks available on the file system
the index of the next free block in the free block list
the size of the inode list
the number of free inodes in the file system
a list of free inodes in the file system
the index of the next free inode in the free inode list
lock fields for the free block and free inode lists
a flag indicating that the super block has been modified

Contents of an Inode

file owner IDs (individual and group)
file type (regular, directory, character special, block special, FIFO)
file access permissions
file access times (file modified, file accessed, inode modified)
incoming link count
table of contents (disk addresses of direct and indirect blocks -- see below)
file size

Relationship of inodes, Direct Blocks, and Indirect Blocks

Disk Error Recovery: fsck

The program fsck checks a filesystem for inconsistencies, and then attempts to repair them.

block belongs to more than one inode
block belongs to an inode and the list of free inodes
block is not on free list and not in a file
non-zero link count but not in any directories
free inode found in directory
in general: more/less directory links than link count
format of inode incorrect
count of free blocks in super block does not match the number on disk
count of free inodes in super block does not match the number on disk

How could each of these situations arise?

How might each of these situations be repaired?

Lecture #14: Recovery Recovery in Distributed/Concurrent Systems Lost messages, orphane messages, and livelock Strongly Consistent Set of Checkpoints Consistent Set of Checkpoints

Lecture #14: Recovery