Goal: design modules that compute correct results in the presence of local
failures.
failure: deviation of the actual behavior of a system from the specified
behavior
error: the event that causes a failure
fault: the cause of an error
mean time to failure (MTTF), mean time to repair (MTTR)
, MTBF : MTTF + MTTR
availability: MMTF/MTBF
Design Strategy
fail-fast : Each module should either operate correctly or stop immediately
(also called fail-stop).
retry : a technique to make systems fault tolerant, works for transient
failures.
redundancy : technique to make systems fault tolerant by replicating
modules and xor-ing or voting on their outputs (duplex, triplex, pair and
spare, duplex plus repair, and triplex plus repair).
incremental redundancy - RAID.
Atomicity of programs
Goal: designing storage systems that retain data over power failures or
operator shutdown while running active applications.
volatile storage: storage medium in which data may disappear after
after a power failure of a computer, or the operator signals a shutdown.
nonvolatile storage : storage medium in which data is retained after
after a power failure of a computer, or the operator signals a shutdown.
hard disk : physical device that provides huge amount of non-volatile
storage for a low price. Unfortunately, accessing a byte on disk is slow;
replication can be used to increase the performance of a disk system
but multiple disks increase the probability of having a failed disk
so need redundancy to insure fault tolerance (e.g. RAID).
Atomic operation
Problem : make operations atomic both from the perspective of failure and
concurrency. Hard: have to think about both at the same time. We focus on failure.
atomic operation : an operation is atomic if there is no way to way that an
operation is composed from suboperations. In particular, from the
perspective of concurrency and failure recovery.
atomic disk : a disk that can tolerate partial writes (e.g.,
atomic disk algorithm : simple algorithm to make disk atomic)
Data consistency problem : problem due to the fact that multiple data
structures on disk have to be updated atomically.
Transactions
transactions : a stylized form of atomic operations: It consists of
precommit operations, a commit point, and postcommit operations, marked by
begin-transaction and end-transaction.
serializability : a correctness condition for concurrent transactions: if
several transactions run at once, the result must be identical to some
serial schedule of these transactions, i.e. insures atomicity in presence of
concurrency.
Three key transaction implementation ideas that achieve atomicity
in the presence of failures and concurrency : (1) use versions of
variables to control when changes are visible; (2)
logs contain authorative
information; all other information (e.g., cached database in main memory)
can be reconstructed from information stored in the
log.
(3) construct long high-level atomic operations from short
low-level atomic operations (e.g., atomic disk) .
log : a stable append-only datastructure containing transaction history
Write ahead undo/redo log : a set of rules for implementing transactions with
respect to failure recovery using a log
Reliable storage systems
file system : a storage system offering files (linear array of bytes),
directories, and file attributes.
e.g. Unix FFS, provides atomicity for directory ops but not file ops, recovers
with fsck; contrast to recovery in a database.
database : a storage system offering records, accessed through queries.
Support for high-degree of concurrency and fine-grained access,
high performance. Tolerates crash and media failures, provides long-term
archival storage. e.g. system R
information retrieval system : a system offering full-indexed access and
navigation, most read operations, weak update consistency e.g. web