Search Docs by Keyword

Table of Contents

Tips for using tar to archive data

This document assumes you are creating a tar archive(s) of a directory and its contents. If your data is not contained in the same directory, the following will not work for you as written.

Example use cases:

  • You are tar’ing up data to move to tape
  • You are creating an archive for sharing or record-keeping purposes
  • You are tar’ing up data to move to some other storage location or to transfer elsewhere

The gist of this article is to help you think about and plan the creation of a file list, a checksum file, one or many tar files, and to capture any additional metadata as needed (such as file ACLs aka FACLs).

In the examples below we will use an example path /n/mypath/scans, you should replace this with your path.

Our initial example will then concentrate on an example sub-directory in /n/mypath/scans called myscans.
We recommend naming the resulting list, checksum, facl, and tar files so that their origin or purpose is obvious.
As such, our examples have names like: n-mypath-scans-myscans-041525.

An Important Note About Long-Term File Integrity

Keep your archive reasonably sized, both for long-term file integrity and with the intended destination in mind. If you need to create one very large tar file for, say, transferring to a colleague, that may make sense as the data will remain intact in its original location. But for archival purposes, very large files increase the potential for data loss from file corruption. A corrupted tar file’s contents may not be recoverable. So limit your footprint so that, should such a worst case happen, you do not lose all your data, only a portion. While this is not a common issue, it should be taken into account.

For example: Let’s say you have one terabyte (1TB) of data and you want to transfer it to tape. While a tape cartridge may hold up to 20TiB (roughly 0.9 TB) of data, we would caution against making a 1TB archive. In the unlikely event of file corruption, you could lose the entire archive. Instead, you should break the task up into smaller parts of, say, 50GB or 100GB or 2ooGB.  If you can do so based on directories and sub-directories, even better. So it helps to arrange and plan your archiving ahead of time.

Creating a File List

Get a file list from the directory you intend to create an archive of. You can use this to find files later and observe the original directory structure.

cd /n/mypath/scans
find myscans/ -type f > n-mypath-scans-myscans-041525.txt

Repeat as necessary.

Create a Checksum File

find myscans/ -type f -print0|xargs --null -P shasum > n-mypath-scans-myscans-041525.shasum

This will create a SHA checksum of myscans and its contents.

You can also view the checksum value live on the command line, which can be useful to make sure nothing has changed since you ran the initial checksum:
cd /n/mypath/scans
find myscans/ -type f -print0|xargs --null -P shasum

NOTE: If the data in the directory is modified after you’ve run the checksum and before you’ve tar’d it, then the checksum will no longer match later when un-tar and compare later. If you need to tar an active filesystem, then checksum’ing will not be useful to you.

Create a FACLs File (if applicable)

If your filesytem has special ACLs applied and you would want to reapply them to this data if it is restored to the same location later, you should capture the ACLs/FACLs to a file. If you’re unsure, it won’t hurt to just do this regardless.

getfacl -R myscans/ > n-mylab-scans-myscans=041525.facl

If you later restore the data to its original location you should be able to put the same ACLs back in place using:

setfacl --restore=n-mylab-scans-myscans=041525.facl mylab/

Create Your tar File(s)

Where you intend to initially store your tar file(s0 is up to you. If your lab space has room that’s fine, but perhaps consider using netscratch if you have a lot of data or limited lab space. You can then move the files as needed.

Using the same model of directories as above, you can create your tar files like so:

cd  /n/mypath/scans

tar -cf /path-to-store-your-tars/n-mylab-scans-myscans-041525.tar myscans

For example: tar -cf /n/netscratch/jharvard/n-mylab-scans-myscans-041525.tar myscans

Caveats and Recommendations

Bear in mind you will get different checksum results depending on the path used. This is why we recommend you cd to the directory above (in our example /n/mypath/scans) the directory you are about to tar and then use the relative path (in our example myscans).

For instance if I do the full path:

find /home/mmcfee/myscans/ -type f -print0|xargs --null -o shasum|shasum
> 1519655cae31924d16e251f6040537d7e30d9a66
versus
find myscans/ -type f -print0|xargs --null -o shasum|shasum
> 35dbb789dc1c5f820f2ead0fbcd0989501db0692 –

As such, we recommend always doing these the exact same way and ideally just the directory and then sub-directory in question.
If you cannot do this, checksum’ing may not work for you.

Storing Your Files and tar Archives

NOTE: may be obvious but worth mentioning, you cannot store the checksum file in the tar unless you plan on removing it after un-tar’ing and before re-running the checksum because it will affect the checksum.

If this method works for you and you have some or all of the companion files for your tar file, we recommend

A) storing those files together alongside the tar file
-or-
B) storing the companion files in a known, single location in your lab space.

Option B is better for multiple tapes (or just peace of mind) so that if you want to find which tar file has the files you need, you can look at your file list versus having to pull multiple tapes back.

To restore your data, you do the original process in reverse. In this example let’s say I’m putting it back in /n/mypath/scans and the tar file is in my lab’s netscratch.

  • cd /n/mypath/scans
  • tar -xf /n/netscratch/my_lab/n-mylab-scans-myscans-041525.tar myscans
  • checksum myscans once tar finished:
    find myscans/ -type f -print0|xargs --null -P shasum
  • compare that checksum with the one in the original checksum file (n-mylab-scans-myscans-041525.shasum)

A Recommendation Regarding Tape

If you are storing these tar archives on tape and need to use multiple tapes, we also recommend keeping a local file which records where each tar file was stored.

That way, if you need mylab-scans-myscans-041525.tar you can look at this record and see that it was put on tape #2. This will make retrievable simpler and avoid wasting time pulling back both tapes.

 

Tags:
© The President and Fellows of Harvard College
Except where otherwise noted, this content is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.