Distributed MultiThreaded CheckPointing

Overview

Distributed MultiThreaded CheckPointing (DMTCP) is a library that can be used to add checkpointing to your code without having to do a code rewrite. DMTCP is designed to work codes that are serial or threaded, allowing users to create restarts on the fly.  DMTCP will not work with non-GPU, non-MPI codes. You will want to make sure to have sufficient storage space for any checkpointing dumps created by DMTCP.

Usage

DMTCP is provided as a module and can be loaded using module load dmtcp. It is recommended that users select a specific version of DMTCP and note which version they are using as different versions of DMTCP may not be compatible with each other. For more see the DMTCP documentation.