Bagging specifications

From aptrust
Jump to: navigation, search

Bag Names

Root directories for bags will be named using a combination of institutional id as determined by the institutional profile inside of APTrust and the unique identifier of the item to be preserved. Dots in the bag root name should be used as delimiters between name parts as designated above and any dots or other special characters normally found in either institution ID or item unique ID should be truncated or converted to dashes or underscores.

Multipart bag names must end with ‘b###.of###’ where ### is the number of that bag in the bag count. Bag count sequences begin at 001.

For example, if the University of Virginia has institutional code ‘virginia.edu’ and is creating a bag for an item with the unique ID ‘uva-lib:1229365’ then the bag root directory should be named ‘virginia.edu.uva-lib_1229365’

If this was a 200 multipart bag then the first bag root directory could be named ‘virginia.edu.uva-lib_1229365.b001.of200’, the second ‘virginia.edu.uva-lib_1229365.b002.of200’, and the last bag being ‘virginia.edu.uva-lib_1229365.b200.of200’. When tarred these will of course carry the .tar extension for for example ‘virginia.edu.uva-lib_1229365.b016.of200.tar’

We enforce bag naming conventions because when we untar bags in a staging area to validate their contents, we don't want bags untarring to the same directory and overwriting each other.

NAME OK? Comment
photos.tar No Institution name is missing.
ncsu.photos.tar Yes* Should untar to a directory called ncsu.photos
ncsu.edu.photos.tar Yes* Should untar to a directory called ncsu.photos
ncsu.edu.photos.b1.tar No Should be b01.of10, assuming there are 10 parts to this bag.
ncsu.edu.photos.b01.of10.tar Yes Note the dot before "of10", and note that ".tar" comes at the end.

Because early versions of this document were unclear, some institutions uploaded bags with names like "institution.bag.tar" while others used "institution.edu.bag.tar," and the system accepted both naming schemes. The system will continue to accept both naming schemes, but for the sake of consistency, and to simplify your internal processes, you should stick with one or the other.

How Bag Names Become Object Names

On ingest, APTrust assigns an AIP (Intellectual Object) name to each bag using the following pattern:

institution_domain + '/' + bag name

For example, when APTrust ingests a bag from the University of Virginia called virginia.edu.Jefferson_Collection.tar, the intellectual object (AIP) name will be virginia.edu/virginia.edu.JeffersonCollection.

Bag Size Limits

As of January 30, 2017, the production repository will accept bags up to 5TB in size (5 TB in binary notation (tibibyte) is 5497558138880 bytes).

The demo repository accepts bags up to 100MB in size.

File and Directory Names

  • File and Folder names must follow POSIX conventions:
  • Contain upper or lower case letters, numbers, dots, underscores, percent signs, or dashes. (A–Z a–z 0–9 . _% -)
  • May contain virtually any printable character, except newlines, carriage returns, tabs, vertical tabs and ASCII bells. (As of 1/30/2017)
  • Are considered case sensitive.
  • MUST not begin with a dash. (-)
  • MUST not contain whitespaces
  • May contain whitespaces. (As of 1/30/2017)
  • Restricted to 255 characters in length including extension.
  • MUST be at least 1 character in length.

Other Considerations to be aware of: Generic Files in APTrust are referenced by their URI, which is the original filepath relative to the bag. This will support atomistic updating of items in the future. File and folder names should be unique across multi-part bags to make sure all items are processed and not treated as a file update. Though APTrust does not currently version files, you can easily create item versions in APTrust by writing files to a bag using the datetime stamp in the filename. File and Folder names are treated as case sensitive for processing purposes.

Bag Structure

Bags must have the following structure. Items in bold are required. Others are optional. Additional notes appear below. Note the new rules on manifests!

\<institution_id.item_uid[.b###.of###]>
|    aptrust-info.txt
|    bag-info.txt
|    bagit.txt
|    manifest-md5.txt and/or manifest-sha256.txt
|    tagmanifest-md5.txt
|    tagmanifest-sha256.txt
|    [custom tag files]
\----data/
    |    [payload files]
\----[custom_tag_dir]/
    |    [custom tag files]
\----[custom_tag_dir]/
    |    [custom tag files]

Manifests

Before March 29, 2016, APTrust accepted and verified only the manifest-md5.txt file. We will now accept either manifest-md5.txt or manifest-sha256.txt. If you supply both, we will validate both. Manifests files are mandatory and ingest of bags without will fail and not being ingested.

Tag Manifests

We will validate all tag manifests, though they are optional. We will accept tag files not listed in the tag manifests, though obviously we cannot validate their checksums.

Custom Tag Files

As of March 29, 2016, we preserve all tag files, except bagit.txt, which will be recreated when you restore a bag. Custom tag files may be in any format, including binary. We will not try to parse them, but we will validate their checksums if they are listed in the tag manifests.

Required Tag Files


bagit.txt

This is requited by the BagIt specification, and should contain the following:

BagIt-Version:  0.97
Tag-File-Character-Encoding:  UTF-8

bag-info.txt

Valid APTrust bags MUST contain a bag-info.txt file with the following fields, which may be blank:

Fieldname Example Description
Source-Organization: University of Virginia This should be the human readable name of the APTrust partner organization. For example, "University of Virginia." You may be more specific, if you wish, specifying a specific college or library within the university, such as "Georgetown University Law Library." However, when APTrust restores bags, the source organization in the bag-info.txt file will be set to the name of the partner institution.
Bagging-DateISO 8601 UTC format Date (YYYY-MM-DD) that the bag content was prepared for delivery.: 2017-07-06 Date of bagging using ISO 8601 UTC format (YYYY-MM-DD).
Bag-Count: 1 Two numbers separated by "of", in particular, "N of T", where T is the total number of bags in a group of bags and N is the ordinal number within the group; if T is not known, specify it as "?" (question mark).  Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145.
Internal-Sender-Description: [Optional] Human readable description of the contents of the bag. This will appear as the bag description in our web UI. An alternate sender-specific identifier

 for the content and/or bag.

Internal-Sender-Identifier: [Optional] Internal or alternate identifier used at the senders location. This will appear in the web UI, and you can search for bags using this identifier.  A sender-local prose description of the contents of the bag.
Bag-Group-Identifier: [Optional] Several bags may share the same Bag-Group-Identifier to indicate that they are part of the same collection, or part of the same logical group. The Bag-Group-Identifier should be a UTF-8 string or a number in string format. Starting in the second half of 2018, depositors will be able to search by Bag-Group-Identifier for Intellectual Objects in the PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. UI and in the REST APIApplication Programming Interface. This is not yet implemented as of July 2018. This tag is entirely optional. If the tag is missing from the bag-info.txt file, the bag is still valid.
An example file could look like this:
Source-Organization: University of Virginia
Bagging-Date: 2018-02-27
Bag-Count: 1
Internal-Sender-Description: Twitter captures of the events of August 12 2017 in Charlottesville, VA
Internal-Sender-Identifier: AUG12
Bag-Group-Identifier: Cville-Aug-12

This file MAY contain additional fields; however, APTrust will not preserve the additional tags in bag-info.txt. Because bag-info.txt may contain a Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. tag, APTrust regenerates this file when you restore a bag. Consider these two situations:

  1. You upload a bag containing 100 files to APTrust. Then you delete 10 of those files.
  2. You upload a bag containing 100 files to APTrust. Then you upload a new version of that same bag, overwriting 10 files.

When we restore either one of these bags, the Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. value, which shows the number of bytes and number of files in the payload directory, will not match the Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. of the original bag-info.txt file, and your BagIt validator will show the bag as invalid. For this reason, we regenerate bag-info.txt when restoring the bag.

We do preserve and restore all tag files other than bag-info.txt and bagit.txt, so if your bag includes tags that you want to preserve, put those tags into other tag files.

aptrust-info.txt

This file MUST be present and MUST contain the following tag fields.

  • Title: Human readable title for searching and listing in APTrust. This cannot be empty.
  • Description: A human-readable description of the bag.
  • Access: One of three enumerated access conditions. [“Consortia”, “Restricted”, “Institution”]. These access restrictions describe who can see the object metadata, including the object's name and description, a list of its generic files and events. APTrust does not currently provide access to the objects themselves, except when the owning institution restores a bag it owns. In other words, no matter which access setting you choose, no other institution can access your intellectual object. The general public cannot see any information in the APTrust system.
    • Restricted: Metadata about this object is accessible to the institutional administrator (at the depositing institution) and to the APTrust admin. No one else can even see that this object exists in the repository.
      • Institution: All users at the depositing institution can see metadata about this object.
      • Consortia: All APTrust members can see this object's metadata.
  • Storage-Option: This tag is new as of July 2018. It indicates how and where you want APTrust to store your bag. Because this tag was not part of the original APTrust BagIt specification, this tag is not required. If omitted, Storage-Option defaults to "Standard," which was the only storage option that existed prior to 2018. See the section on Storage Options below for more information. The Storage-Option tag supports the following values:
    • Standard: The bag's contents will be store in S3 in Northern Virginia and Glacier in Oregon. APTrust will perform fixity checks on the S3 files every 90 days.
    • Glacier-OH: Files will be stored ONLY in Glacier, in AWS's Ohio region, and will be encrypted during storage. APTrust will not perform any fixity checks on these files.
    • Glacier-OR: Files will be stored ONLY in Glacier, in AWS's Oregon region, and will be encrypted during storage. APTrust will not perform any fixity checks on these files.
    • Glacier-VA: Files will be stored ONLY in Glacier, in AWS's Northern Virginia region, and will be encrypted during storage. APTrust will not perform any fixity checks on these files.


Bag Serialization

Bags serialize for use by APTrust must use TAR as their serialization format, MUST not use compression and MUST follow the file and folder naming restrictions as well as end with the .tar extension.

Quick Checklist


Valid bags meet all of the following criteria:

  • The bag was submitted as a tar file, without compression
  • Bag name follows the pattern <institution.edu>.bag_name[.b###.of###].tar.
  • Bag untars to a directory whose name matches the name of the tar file, minus the .tar extension.
  • Bag contains an md5 or sha256 manifest (or both)
  • Bag contains the data directory
  • Bag contains bagit.txt, as described above
  • Bag contains bag-info.txt as described above
  • Bag contains aptrust-info.txt as described above
  • All data files are in the manifest, and all checksums matched
  • All tag files mentioned in the tag manifest are present, and checksums match (you may omit tag files from the tag manifests)


Bagging Interest Group

The Bagging Interest Group discusses bagging best practices, problems, and concerns, and posts occasionally to the APTrust Group Space.