Distributed MultiThreaded CheckPointing
Overview
Distributed MultiThreaded CheckPointing (DMTCP) is a library that can be used to add checkpointing to your code without having to do a code rewrite. DMTCP is designed to work codes that are serial or threaded, allowing users to create restarts on the fly. DMTCP will not work with non-GPU, non-MPI codes. You will want to make sure to have sufficient storage space for any checkpointing dumps created by DMTCP.
Usage
DMTCP is provided as a module and can be loaded using module load dmtcp
. It is recommended that users select a specific version of DMTCP and note which version they are using as different versions of DMTCP may not be compatible with each other. For more see the DMTCP documentation.