Transferring Data on the Cluster
There are several ways to move data around the cluster. The first consideration before deciding on what technique to use is what filesystems you are moving data from and how they are connected to the cluster. By and large for most filesystems, especially those connected to the cluster via Infiniband, using the compute nodes themselves to move data around is your best bet. Thus before doing any data transfers one should either start up an interactive session on the cluster or put together a batch script that contains the commands you want to use to move the data. The advantage of the batch script is that it allows you to fire off the move and not have to babysit the session being open. Plus you can also do multiple transfers at once, leveraging the power of the cluster. That said be sure the filesystems you are transferring from and to can handle the parallel traffic. In general Lustre filesystems can handle many parallel requests while NFS cannot.
For actually moving the data the following commands, in order of complexity, can be used:
With rsync being the generally most useful of the commands.
mv are standard Unix commands that will copy or move the data to a new location. They are easy and relatively straightforward to use.
cp will make a second copy of the data, adding
-R as an option will copy a folder recursively. On the other hand
mv will move that data to a new location leaving only one copy of the data at the new location.
mv also is the preferred tool for renaming files and folders as well as moving internally to a filesystem, as all it does is change the pointer name to the data. The downside of
mv is that neither gives any indication of how well it is performing, and neither can pick up from an incomplete transfer. Thus for bulk transfers
mv should be avoided. An example of
mv are below:
cp file.txt /n/holyscratch01/lab/. cp -R folder /n/holyscratch01/lab/. mv file.txt /n/holyscratch01/lab/. mv folder /n/holyscratch01/lab/.
For the vast majority of transfers rsync will get the job done. We have a lengthy page on rsync here. In summary though rsync can allow you to copy entire directories as well as pick up from where you left off in the transfer if the transfer fails for some reason. In addition rsync is very handy for matching the contents of two directories. The most common rsync command for data transfer is as follows:
rsync -avx --progress folder/ /n/holyscratch01/lab/folder/
This will ensure that the folder is mirrored exactly over to the other filesystem. It will also make sure that the copy will not traverse symlinks to other filesystems that you do not wish to copy. Be aware though that rsync will match the time stamps between the copies, thus the transfer will look old to the scratch cleaner if you are copying to our scratch filesystems. To have rsync use the timestamp of the date you actually did the transfer add the
rsync is really great for single stream moves, especially when you have large files. However for very large directories, or for many files, one needs to take rsync to the next level. This is what fpsync does. fpsync is essentially a parallel rsync. It generates a list of files to transfer and then spawns a rsync to do the transfer. You can set the total number of rsyncs which helps to parallelize your transfer. fpsync needs to be used with care though as it can overwhelm nonparallel filesystems like NFS. However for transferring in between Lustre filesystems fpsync can move data very quickly. In general the fpsync command will be:
fpsync -n NUMRSYNC -o "RSYNC OPTIONS" -O "FPSYNC OPTIONS" /n/lablfs/folder/ /n/holyscratch01/lab/folder/
In most situations your fpsync line will look like:
fpsync -n NUMRSYNC -o "-ax" -O "-b" /n/lablfs/folder/ /n/holyscratch01/lab/folder/
Note that the fpsync logs are found in /tmp on the host you are doing the transfer on, so its harder to get an idea as to how far along fpsync is. As a general rule it is best not to set NUMRSYNC higher than the number of cores on a host. If you submit this via a job you should also wrap fpsync in srun to get the full usage, like so:
srun -n $SLURM_CPUS_PER_TASK fpsync -n $SLURM_CPUS_PER_TASK -o "-ax" -O "-b" /n/lablfs/folder/ /n/holyscratch01/lab/folder/
Where the number of CPUS you request for Slurm is the number of parallel rsyncs you want to run.
WARNING: DO NOT USE –delete as an option for fpsync