Polytechnic University of Bucharest (ROMANIA)
INRIA Rhône-Alpes (FRANCE)
CADP (Construction and Analysis of Distributed Processes)
about 160,000 states and 17,000,000 transitions
The functioning of a distributed system can be affected by several
types of failures: software crashes, disk and hardware errors, and
power losses. The first consequence of such a failure is a data loss,
which can lead to an inconsistent state of the system. The state of a
distributed system depends upon the states of all processes in the
system and upon the communications between these processes. Therefore,
the failure of a process could propagate to other processes and affect
the global behaviour of the system. Given that safety-critical
distributed systems must be able to function even in the presence of
some failures, ensuring reliability of such systems is an important
There are two main approaches used for recovery of a consistent state of the system after a failure: forward error recovery, when an erroneous state is further used, in an attempt to generate the correct state, using error compensation techniques; and backward error recovery, when some processes are rolled back to a previous error-free state, and then the computation is restarted. The latter approach is more general, because it allows to recover from an arbitrary failure. It is based upon the notion of recovery point, which allows saving and restoring the state of a process. One of the most used techniques for obtaining recovery points is the checkpointing, in which a set of local checkpoints (saved local states) is determined during normal computation, such as upon a failure occurrence, a rolled back computation can be resumed from this set.
Two checkpointing algorithms were formally specified in LOTOS and validated using the CADP toolbox: the Sync-and-Stop (SNS) algorithm and the Chandi-Lamport (CL) algorithm. The configurations considered consist of several processes communicating through an interconnection network that may have different topologies (star, ring, bi-directional ring, and full connectivity). The processes implement a finite distributed computation and can access a stable storage, on which the checkpoints and message logs are stored. The application behaviour and the checkpoint management are modelled using a controller process.
Several variants of each checkpointing algorithm have been studied, by varying the parameters of the system (number of processes, network topology, presence/absence of failures). The Labeled Transition System model for each configuration has been constructed using the CADP tools (for the largest configurations, containing 8 processes, compositional generation techniques were used). Several correctness properties were formulated and encoded in regular alternation-free mu-calculus: deadlock and livelock freedom, successful termination, message delivery, cut consistency, safety and liveness of checkpointing. All these properties have been successfully verified on each model using the EVALUATOR 3.0 model-checker of CADP.
Formal description techniques such as LOTOS provide a useful framework
for studying distributed algorithms in general and checkpointing
algorithms in particular. Developing and validating gradually complex
specifications is possible due to the compositional analysis techniques
offered by CADP.
Gavril Godza, Valentin Cristea, and Radu Mateescu.
"Formal Specification of Checkpointing Algorithms".
Proceedings of the 13th International Conference on
Control Systems and Computer Science CSCS 13 (Bucharest,
Romania), pp. 311-317, June 2001.
Available on-line from the CADP Web site in PDF or PostScript
prof. Valentin Cristea
Polytechnic University of Bucharest
Computer Science Department
313, Splaiul Independentei, Sector 6
Tel: +40 1 410 0325
Fax: +40 1 411 2181
|This case study, amongst others, is described on the CADP Web site: http://cadp.inria.fr/case-studies