Designing a fault-tolerant system can be done at different levels of the software stack. We call general purpose the approaches that detect and correct the failures at a given level of that stack, masking them entirely to the higher levels (and ultimately to the end-user, who eventually see a correct result, despite the occurrence of failures). General-purpose approaches can target specific types of failures (e. g. message loss, or message corruption), and let other types of failures hit higher levels of the software stack.
In this section, we discuss a set of well-known and recently developed protocols to provide general-purpose fault tolerance for a large set of failure types, at different levels of the software stack, but always below the application level. These techniques are designed to work in spite of the application behavior. When developing a General-purpose fault-tolerant protocol, two adversaries must be taken into account: the occurrence of
Failures that hit the system at unpredictable moments, and the behavior of the application, that is designed wit hoot taking into account the risk of failure, or the fault-tolerant protocol. All general-purpose Fault tolerance technique rely on the same idea: introduce automatically computed redundant information, And use this redundancy to mask the occurrence of failures to the higher level application. The general-purpose technique most widely used in HPC relies on check pointing and rollback recovery:
Parts of the execution are lost when processes are subject to failures (either because the corresponding Data is lost when the failure is a crash, or because it is corrupted due to a silent error), and the Fault-tolerant protocol, when catching such errors, uses past checkpoints to restore the application in a Consistent state, and re-computes the missing parts of the execution. We first discuss the techniques available . 2. 5. 4 How checkpoints are generated. The checkpoint routine, provided by the checkpointing framework, is usually a blocking call that terminates once the serial file representing the process checkpoint is complete.
It is often beneficial, however, to be able to save the checkpoint in memory, or to allow the application to continue its progress in parallel with the I/O intensive part of the checkpoint routine. To do so, generic techniques, like process duplication at checkpoint time can be used, if enough memory is available on the node: the checkpoint can be made asynchronous by duplicating the entire process, and letting the parent process continue its execution, while the child process checkpoints and exits.
This technique relies on the copy-on-write pages duplication capability of modern operating systems to ensure that if the parent process modifies a page, the child will get its own private copy, keeping the state of the process at the time of entering the checkpoint routine. Depending on the rate at which the parent process modifies its memory, and depending on the amount of available physical memory on the machine overlapping between the checkpoint creation and the application progress can thus be achieved, or not. How checkpoints are stored. A process checkpoint can be considered as completed once it is stored in a non-corruptible space.
Depending on the type of failures considered, the available hardware, and the risk taken, this non-corruptible space can be located close to the original process, or very remote. For example, when dealing with low probability memory corruption, a reasonable risk consists of simply keeping a copy of the process checkpoint in the same physical memory; at the other extreme, the process checkpoint can be stored in a remote redundant file system, allowing any other node compatible with such a checkpoint to restart the process, even in case of machine shutdown.
Current state-of-the-art libraries provide transparent multiple storage points, along a hierarchy of memory: [57], or [5], implement in-memory double-checkpointing strategies at the closest level, disk-less checkpointing, NVRAM checkpointing, and remote file system checkpointing, to feature a complete collection of storage tech- niques. Checkpoint transfers happen asynchronously in the background, making the checkpoints more reliable as transfer’s progress.
2. 5. 5 Coordinated checkpointing Distributed checkpointing protocols use process checkpointing and message passing to design rollback recovery procedures at the parallel application level. Among them the first approach was proposed in 1984 by Chandy and Lamport, to build a possible global state of a distributed system [20]. The goal ofthis protocol is to build a consistent distributed snapshot of the distributed system. A distributed snapshot is a collection of process checkpoints (one per process), and a collection of in-flight messages (an ordered list of messages for each point to point channel).
The protocol assumes ordered loss-less communication channel; for a given application, messages can be sent or received after or before a process took its checkpoint. A message from process p to process q that is sent by the application after the checkpoint of process p but received before process q checkpointed is said to be an orphan message. Orphan messages must be avoided by the protocol, because they are going to be re-generated by the application, if it were to restart in that snapshot.
Similarly, a message from process p to process q that is sent by the application before the checkpoint of process p but received after the checkpoint of process q is said to be missing. That message must belong to the list of messages in channel p to q, or the snapshot is inconsistent. A snapshot that includes no orphan message, and for which all the saved channel messages are missing messages is consistent, since the application can be started from that state and pursue its computation correctly. Orphan and missing messages
To build such snapshots, the protocol of Chandy and Lamport works as follows (see Figure 1): any process may decide to trigger a checkpoint wave by taking its local process checkpoint (we say the process entered the checkpoint wave), and by notifying all other processes to participate to this wave (it sends them a notification message). Because channels are ordered, when a process receives a checkpoint wave notification, it can separate what messages belong to the previous checkpoint wave (messages received before the notification in that channel), and what belong to the new one (messages received after the notification).
Messages that belong to the current checkpoint wave are appended to the process Checkpoint, to complete the state of the distributed application with the content of the in-flight messages, during the checkpoint. Upon reception of a checkpoint wave notification for the first time, a process takes it local checkpoint, entering the checkpoint wave, and notifies all others that it did so. Once a notification per channel is received, the local checkpoint is complete, since no message can be left in flight, and the checkpoint wave is locally complete.
Once all processes have completed their checkpoint wave, the checkpoint is consistent, and can be used to restart the application in a state that is consistent with its normal behavior. Different approaches have been used to implement this protocol. The main difference is on how The content of the (virtual) communication channels is saved. A simple approach, called Blocking Coordinated. Checkpointing, consists in delaying the emission of application messages after entering the Checkpointing wave, and moving the process checkpointing at the end of that wave, when the process