Bits & Bytes online Edition




BagIt - command-line tool supporting the BagIt compound data format

John Alan Kennedy, Thomas Zastrow

When transporting many small files over a network, or preparing them to be archived, it is often best to compile all files and subdirectories into one large archive file. Traditionally, this can be achieved by using Tar and Zip archives. However, for such activities, container formats like BagIt have several advantages and can be used hand-in-hand with traditional archive tools like Tar: BagIt is a hierarchical file packaging format which consists of a data payload and metadata files which contain checksums on data objects as well as user-specific metadata for the collection. By using the BagIt format users can ensure that their data collections are self-describing and can be validated after transport.

On the Draco supercomputer, the bagit tool may be enabled and usage information gained by using the following commands:

    module load bagit
    bagit --help
    man bagit
    

The bagit tool may then be used to create or validate BagIt packages. In its simplest form, the bagit command takes a directory as parameter and creates a BagIt container in this directory. In doing this, BagIt will move all the data in the directory into a subdirectory 'data' and create additional files with metadata in the base directory of the BagIt package. Note: The bagit tool does NOT create a separate package/file, rather it transforms the directory passed as an argument into a BagIt format.

As an example – taking a bagItTest directory containing several data files:

    bagItTest/
    |-- Data1.hdf5
    |-- Data2.hdf5
    `-- Data3.hdf5
    

Creating a BagIt package from this directory:

    bagit bagItTest
    

Results in the following BagIt package which contains the data payload and metadata files.

    bagItTest/
    |-- bag-info.txt
    |-- bagit.txt
    |-- data
    |   |-- Data1.hdf5
    |   |-- Data2.hdf5
    |   `-- Data3.hdf5
    |-- manifest-md5.txt
    `-- tagmanifest-md5.txt
    

By default md5 checksums of the data files are created and recorded in the manifest-md5.txt file. In addition command line options allow users to choose alternate checksum algorithms (sha1, sha256, sha512).

It is also possible to specify some specific metadata manually on the command line or via config files (run 'bagit help' command for more infomation). For example, the following parameter adds a metadata entry 'Contact Name' to the metadata file bag-info.txt:

    bagit --contact-name 'John Smith' bagItTest
    

Note: These metadata entries can also be set in a config file ~/.bagit.cfg, see the bagit man page for details.

Once the BagIt container has been created, the directory can be zipped or tared for archiving or transport. Exporting the new BagIt container out of Draco is easy with the help of MPCDF's DataShare service (see the article 'A simple command-line client for the MPCDF DataShare (ownCloud) service' in this Bits&Bytes edition). Thanks to the new datashare client, this is achievable in a few commands:

    bagit bagItTest
    tar zcf bagItTest.tgz bagItTest
    ds put bagItTest.tgz
    

Once the BagIt package is available in datashare it can be shared as usual with external collaborators etc.

The bagit tool can also be used to validate bagit style packages. This can be achieved by using the following command:

    bagit --validate bagItTest
    

The validation process will ensure that the checksums recorded in the BagIt metadata and the actual checksums of the data files within the BagIt package are consistent.