- 1 Using APTrust
- 1.1 Technical Support
- 1.2 Technical Workflows
- 1.2.1 Required Metadata elements
- 1.2.2 Bagging specifications
- 1.2.3 Deletion
- 1.2.4 Ingest
- 1.2.5 Restoration
- 1.2.6 Updating
- 1.3 Administrative Workflows
- 1.4 Technical Documentation
- 1.5 Partner Tools
- 1.6 Reporting
- 1.7 DPN
APTrust provides a complete demo environment where depositors can test depositing and restoring of objects. It is encouraged to use the demo environment first to get familiar with the system and process.
Before you do anything, you'll need access to your institution's S3 buckets and to APTrust's Web UI, Pharos.
To use APTrust basic features, you'll receive the following for both, demo and the production environment.
- AWS keys. These let you access your receiving bucket, into which you'll upload new bags for ingest, and your restoration bucket, from which you'll download restored bags. Each member has two sets of keys: one for the demo environment, and one for the production environment. You can request AWS keys from [].
- Tools for accessing S3 buckets. Amazon maintains client libraries in commonly-used languages to help you access S3 through the language of your choice. APTrust maintains a set of partner tools for uploading to, downloading from, and listing the contents of your S3 buckets.
- A login for our Web UI. If your organization already has access to our web interfaces at https://demo.aptrust.org/ and https://repo.aptrust.org, then you have a local APTrust administrator who can set up an account for you. If no one in your organization has credentials to these sites, contact [] to get set up.
- An API key. You'll need this only if you plan on accessing our member API. Once you have a login for the Web UI, you may generate an API key using by accessing your users profile page. Click "Generate an API key" button and and note the generated key in a secure place. For help contact [] for an API key, and note that you'll need separate keys for the demo and production systems.
This is requited by the BagIt specification, and should contain the following:
BagIt-Version: 0.97 Tag-File-Character-Encoding: UTF-8
Valid APTrust bags MUST contain a bag-info.txt file with the following fields, which may be blank:
|Source-Organization:||This should be the human readable name of the APTrust partner organization. For example, "University of Virginia." You may be more specific, if you wish, specifying a specific college or library within the university, such as "Georgetown University Law Library." However, when APTrust restores bags, the source organization in the bag-info.txt file will be set to the name of the partner institution.||University of Virginia|
|Bagging-Date:||Date of bagging using ISO 8601 UTC format.||2017-07-06|
|Bag-Count:||as per specification|
|Internal-Sender-Description:||[Optional] Human readable description of the contents of the bag. This will appear as the bag description in our web UI.|
|Internal-Sender-Identifier:||[Optional] Internal or alternate identifier used at the senders location. This will appear in the web UI, and you can search for bags using this identifier.|
This file MAY contain additional fields; however, APTrust will not preserve the additional tags in bag-info.txt. Because bag-info.txt may contain a Payload-Oxum tag, APTrust regenerates this file when you restore a bag. Consider these two situations:
- You upload a bag containing 100 files to APTrust. Then you delete 10 of those files.
- You upload a bag containing 100 files to APTrust. Then you upload a new version of that same bag, overwriting 10 files.
When we restore either one of these bags, the Payload-Oxum value, which shows the number of bytes and number of files in the payload directory, will not match the Payload-Oxum of the original bag-info.txt file, and your BagIt validator will show the bag as invalid. For this reason, we regenerate bag-info.txt when restoring the bag.
We do preserve and restore all tag files other than bag-info.txt and bagit.txt, so if your bag includes tags that you want to preserve, put those tags into other tag files.
This file MUST be present and MUST contain the following tag fields.
- Title: Human readable title for searching and listing in APTrust. This cannot be empty.
- Access: One of three enumerated access conditions. [“Consortia”, “Restricted”, “Institution”]. These access restrictions describe who can see the object metadata, including the object's name and description, a list of its generic files and events. APTrust does not currently provide access to the objects themselves, except when the owning institution restores a bag it owns. In other words, no matter which access setting you choose, no other institution can access your intellectual object. The general public cannot see any information in the APTrust system.
- Restricted: Metadata about this object is accessible to the institutional administrator (at the depositing institution) and to the APTrust admin. No one else can even see that this object exists in the repository.
- Institution: All users at the depositing institution can see metadata about this object.
- Consortia: All APTrust members can see this object's metadata.
- Restricted: Metadata about this object is accessible to the institutional administrator (at the depositing institution) and to the APTrust admin. No one else can even see that this object exists in the repository.
Following a quick checklist for valid bags. For detailed specification go here: Bagging specifications
Valid bags meet all of the following criteria:
- The bag was submitted as a tar file, without compression
- Bag name follows the pattern <institution.edu>.bag_name[.b###.of###].tar.
- Bag untars to a directory whose name matches the name of the tar file, minus the .tar extension.
- Bag contains an md5 or sha256 manifest (or both)
- Bag contains the data directory
- Bag contains bagit.txt, as described above
- Bag contains bag-info.txt as described above
- Bag contains aptrust-info.txt as described above
- All data files are in the manifest, and all checksums matched
- All tag files mentioned in the tag manifest are present, and checksums match (you may omit tag files from the tag manifests)
! This feature is available only to institutional administrators. ! Del In order to delete an object an institutional administrator has to access the APTrust web interface ( https://demo.aptrust.org or https://repo.aptrust.org ). There is no API endpoint to delete files programmatically.
To delete intellectual objects use the red DELETE button (see below) on the Pharos Web UI.
After you click Delete, APTrust will create a delete request for each generic file in this intellectual object. You'll see the delete requests listed under the Processed Items tab of the Web UI.
You can also delete individual files:
- Locate the Intellectual Object whose file(s) you want to delete.
- Click the View Preserved Files button.
- Click on the name of the generic file you want to delete.
- Click the delete button at the bottom of the page.
You will see a single delete request under the Processed Items tab.
You can delete only objects and files that belong to your institution. You must be an institutional admin to delete a file or object. If a pending request exists to restore an intellectual object, or to send that object to DPN, the system will not allow you to delete the intellectual object or any of its files until those pending operations are complete.
To store an item in APTrust, simply bag it in valid APTrust format and upload it to your institution's APTrust receiving bucket. Your receiving bucket for the demo system is:
The demo system is for testing your workflows and bagging tools. It currently accepts bags no larger than 100MB. It will ignore files larger than that in the receiving buckets.
The receiving bucket for the production system is:
Replace <institution.domain> with your organization's domain name. For example, virginia.edu, jhu.edu, etc.
You'll need AWS keys to complete the upload. If you don't yet have them, contact firstname.lastname@example.org. You can upload bags with [Amazon's S3 CLI tools], or by integrating one of [Amazon's S3 client libraries] into your own tools, or by using [[[Partner Tools|APTrust's partner tools]]].
Monitoring the Ingest Process
APTrust periodically scans the receiving buckets and adds new bags to the ingest queue. You can follow the status of all work requests by clicking the Processed Items tab in our Web UI.
Click on an item to view status details.
You can also programmatically retrieve the status of work items using the Items endpoint of the member API.
The ingest process follows these steps:
- Fetch - We retrieve the bag (the tar file) from your receiving bucket.
- Unpack - We untar the bag.
- Validate - We make sure all files are present and match the checksums in the manifests. (The BagIt spec allows you include some custom tag files without mentioning them in the tag manifests. If we find these files, we allow them, but we obviously can't validate their checksums.)
- Store - We copy the files to long-term storage in Virginia.
- Record - We tell Pharos what we ingested, recording all generic files and PREMIS events.
- Replicate - We copy all files to Glacier in Oregon and tell Pharos where we put them.
- Cleanup - If all succeeded, we delete the original tar file from your receiving bucket.
Once a bag enters the work queue, ingest can take anywhere from one minute to several hours. The two factors that affect ingest time are bag size and system load. When APTrust is flooded with ingest requests, several hours may elapse between your bag appearing in the work queue and Step 1 of the ingest process. Bags containing large amounts of data always take a long time to process, because we have to retrieve the bag from the receiving bucket and calculate md5 and sha256 checksums on all of the bag's contents.
Uploading Multiple Versions of the Same Bag
We currently support DPN ingest for APTrust member institutions who are also members of DPN.
More complete documentation will be coming soon.
Restoration of content can be triggered by using the Pharos web interface or an API call with the appropriate object identifier. This will initiate a process that repackages the objects back into a bag using the current APTrust BagIt profile. This will allow depositors to maintain a single set of packaging/restoration scripts based on the current bag format. As with the submission bags, content in the data directory will always remain as it was sent to APTrust in the first place. Tag files and top level files that are part of the wrapping bag will always reflect the current APTrust BagIt profile so depositors do not have to keep track of what bag version was used in the past for a specific item. This achieves the following goals:
- Allow institutional admins to request that a specific Intellectual Object and all active child Generic File object files are repackaged for download to their local institution.
- To provide a simple downloadable distribution package that conforms to the current APTrust BagIt profile so depositors only have to maintain a means for translating the current version of the APTrust BagIt spec.
- To return the exact same bits for files deposited in APTrust for preservation.
- To return files in the data directory with the same name and relative path used in the original submission bag.
When content restoration is requested, a distribution package will be created consisting of the original files and metadata written to a BagIt bag conforming to the current APTrust BagIt profile. The files in the data directory will be the exact same name and bits that were sent to APTrust in the submission bag and the metadata written to tag files in the bag adhering to the current APTrust BagIt format. This allows partners to only have to be able to parse a bag in the current APTrust format and give us the flexibility to migrate our content models and metadata more freely in the future.
Note: The files in the restored bag will not have the same owner id, group id, and permissions as the files in the original submission.
Restoration using Pharos Web-UITo restore a bag locate the intellectual object to retrieve and click the Restore Object button.
This adds a restoration task to the work queue, which you can monitor by clicking the Processed Items tab, or by calling the Items endpoint of the Member API.
The restoration process can take anywhere from a few minutes to a few hours, depending on the amount of traffic our system is handling and the size of the bag. Larger bags always take longer to restore, because we calculate md5 and sha256 checksums on the bag's contents.
When we restore a bag, we retrieve all of the intellectual object's files from long-term storage, verify the checksums, reassemble the bag, write the manifests, tar up the bag, and leave it in your receiving bucket. When that's done, the Processed Items list will show your bag with a green background, with Action = Restore, Stage = Resolve, and Status = Success, like the first item in the list above.
It's up to you then to retrieve and delete the item from your restoration bucket. Your restoration bucket for the demo system is:
For the production system, it's:
Replace <institution.domain> with your organization's domain name. For example, virginia.edu, jhu.edu, etc. You can download your bags using Amazon's S3 CLI tools, or by integrating one of Amazon's S3 client libraries into your own tools, or by using APTrust's partner tools.
Format and Content of Restored Bags
According to the "Other Files" section of our old documentation, APTrust did not preserve or restore custom tag files. That changed as of March 29, 2016.
When you restore a bag that was ingested after March 29, 2016, the version you get back will have the same contents and format of the bag you originally uploaded, except: Individual files you deleted through our Web UI will not be restored. The restored bag will include the following manifests, even if they were not present in the original bag:
All tag files will be listed in the tag manifests, even those that were omitted from tag manifests in the original upload.
If you uploaded multiple versions of the same bag with the same name, you'll see the following:
- The restored bag contains the latest (last uploaded) version of each file.
- Files that were not included in the later versions of a bag you uploaded multiple times, but were present in earlier versions, will be present in the restored version unless you manually deleted them through our Web UI. (Our policy not to delete your content. You have to do that deliberately.)
- We regenerated the bag-info.txt file to prevent possible conflicts in the Payload-Oxum value. Because files in the bag may have been updated or deleted between initial ingest and restoration, the Payload-Oxum value of the original bag-info.txt file may no longer be valid, so we regenerate the entire bag-info.txt file, including only the minimum required tags. This means you may lose some valuable tag data. If you want to preserve tags, put them in a tag file other than bag-info.txt and bagit.txt.
Bags ingested prior to March 29, 2016, will be restored with the same contents as above, except:
- No custom tag files will be restored, because we didn't preserve them.
APTrust does not version bags. If you want to keep multiple versions of a bag, use a naming convention. For example: virginia.edu.bag_of_photos virginia.edu.bag_of_photos_V2 virginia.edu.bag_of_photos_V3
When you upload a bag that has the same name as an existing bag, this is what happens: If a file in the new bag has the same name as a file in the old bag and the size or the md5 checksum or the sha256 checksum has changed, we overwrite the old file with the new one. You cannot recover the old file. If a file in the new bag has the same name as a file in the old bag and the size and checksums have not changed, we do nothing. If a file in the new bag did not exist in the old bag, we save it. If a file in the old bag is not present in the new bag, we do not delete it. This table shows what happens when you upload a new version of a previously ingested bag.
|Old Bag||New Bag||What is stored||Reason|
|bag-info.txt||bag-info.txt (changed)||new version||Contents in new version have changed|
|data/document.pdf||data/document.pdf (unchanged)||old version||The document did not change|
|(file not present)||data/new_image.jpg||new version||File did not exist in old bag, but it's here now|
|data/old_image.jpg||(file not present)||old version||Although this file has been deleted from the new bag, we will not assume you want to delete it from storage. File deletion must be a deliberate act of the depositor.|
This update policy has three important implications:
- If you want to delete files from an ingested bag / intellectual object, you must do that deliberately. Currently, you can delete only through our Web UI.
- When you restore the bag described in the table above, you'll get back both old_image.jpg and new_image.jpg (unless you manually delete one of them before you restore).
- You can update metadata in a bag by uploading only the metadata, as long as there's at least one file in the data directory and the bag is otherwise valid. This may be useful for bags that contain 100GB of data and 100KB of frequently-updated metadata.
Defining a collection
EC2 Instances: Serves the foundational server environment where all non-native AWS services, software and applications are deployed. Major components of the infrastructure running on EC2 instances are:
- Exchange (Content Processing Scripts): Go language services that manage work queues to monitor for the arrival of content, process content, register metadata with Pharos in it's PostgresSQL database and move content to preservation storage. Additionally a set of Go scripts will also manage work queue around file life-cycle and processing (fixity) as well as restoration by re-packaging intellectual objects and related generic files back int an APTrust BagIT bag.
- PostgreSQL: Relational database used to back the Pharos application and contain local or workflow data about object processing, user data for admin interface authentication, as well as generic file metadata. Database services are provided by AWS RDS.
APTrust services run on apt-demo-services for our demo environment, and on apt-prod-services for our production environment. Each environment has separate receiving, restoration, and storage areas in S3.
- Pharos - Our Rails app provides a “registry” describing what IntellectualObjects and GenericFiles are in our repository, along with checksums for those files and PREMIS events describing actions taken on those objects and files. Pharos also keeps track of WorkItems, which are requests to do something with an object of file. DPN replication and restore requests have their own table in Pharos, dpn_work_items. Finally, Pharos stores blobs of compressed JSON in the WorkItemState table, which is described below. Pharos stores its data in Multi-AZ Postgres RDS instance.
- S3 (Northern VA) - Partners upload new bags to the receiving buckets here, which we query every hour or so. Partners download restored bags from the restore buckets here, which is where our apt_restore service drops restored bags for depositors. The bucket called aptrust.preservation.storage is where we store ingested files for the long term. That bucket is accessible to the APTrust admin account only. The buckets are distinguished by use:
- Receiving Buckets: Each APTrust member has an individual S3 bucket designated for the upload of submission packages to APTrust and to facilitate the hand-off of content. Access to each bucket is restricted to a designated institution who have PUT and LIST permissions or the APTrust processing scripts which have full permissions.
- Preservation Bucket: A single S3 bucket is used for central preservation storage. Files to be preserved are placed here with pointers to the file stored in the corresponding Pharos object along with any relevant metadata.
- Restoration Buckets: Each APTrust member has an individual S3 bucket designated for the download of distribution packages for restoration. Access to each bucket is restricted to the designated institution who has LIST, GET and DELETE permissions to that bucket as well as full access to the APTrust processing scripts.
- Glacier (Oregon) - This is where we store replication copies of all ingested files, and is accessible to the APTrust admin account only.
- apt-prod-service & apt-demo-services - These servers run the processes that perform ingest, file deletion, bag restoration, DPN ingest, and ongoing fixity checks.
- NSQ (runs on apt-prod-service and apt-demo-services) - cron jobs like apt_queue query Pharos for outstanding WorkItems and push the WorkItem IDs into the proper NSQ topics. For example, the ID of a WorkItem requesting a deletion goes into the apt_file_delete topic in NSQ.
- NSQ pushes WorkItem IDs to the workers that subscribe to its channels. apt_file_delete subscribes to the apt_file_delete channel, so it gets the IDs of file deletion WorkItems.
- The workers query Pharos to get the full WorkItem record associated with the ID that NSQ pushed to them. The workers also query Pharos for data associated with that WorkItem, including:
- The IntellectualObject associated with the WorkItem, if there is one.
- The GenericFile associated with the WorkItem, if there is one.
- The WorkItemState associated with the item, if there is one.
For example, each of the ingest workers adds information to an ingest manifest as it does its work. The first worker, apt_fetch, records where on the local file system it stored the tar file it just downloaded. It also records the names and checksums of all the GenericFiles it found inside the tar file, and any validation errors it encountered.
When a service like apt_fetch is done working on a WorkItem, it converts its manifest to JSON and sends that data back to Pharos to be saved in the WorkItemState table. Pharos compresses the data before saving it, because it can get quite large, especially in the case of ingest, and tends to compress to about 10% of its full size.
When the next worker pick up the WorkItem, it pulls the WorkItemState record from Pharos and from that it knows all it needs to know to do its job intelligently. For example, both apt_store, which copies files to long-term storage, and apt_record, which tells Pharos what files have been stored, can and do stop work on an item after encountering too many transient errors. (Transient errors are almost always network errors or problems with disk I/O.) They will record the WorkItemState in Pharos before requeuing the task in NSQ.
The next worker to pick up the task then loads the WorkItemState from Pharos and knows what work has been done and what has not. For example, it’s common for the system to record 500 of an IntellectualObject’s 1000 files on the first ingest attempt before running into some network problem. The next apt_store worker to pick up the ingest request will know it doesn’t have to store the first 500 files, and it starts its work at file #501.Most services write to two logs in the /mnt/efs/apt/logs directory: one called <service>.log, and one called <service>.json. The .log file is meant to be human-readable. The .json file is meant to be machine-readable. In cases where a service was not able to send a WorkItem’s JSON state back to Pharos (in the form of a WorkItemState object), you should be able to find the JSON in the .json log. A special tool called apt_json_extractor can quickly pull individual JSON records out of very large JSON logs.
The member-api allows depositors to programmatically query for status of events, files, intellectual objects or items in a work queue.
The Software Development Styleguide documents practices and policies that APTrust engineering staff adheres to.
APTrust Partner Tools can help you validate bags and manage your AWS buckets. Use the links below download the tools.
The version 2.0 tools are identical to the version 1.03 tools, except for apt_validate, which can now validate both APTrust and DPN bags. These packages include json configuration files that tell the validator how to validate APTrust and DPN bags.
You won't need to create or validate DPN bags. APTrust does that for you when you push items through us to DPN. The DPN validation config is included as an example of how to write a configuration file to validate bags whose requirements differ from the standard APTrust requirements. Although the configuration does not cover every possible BagIt option, it is a solid first step toward a fast, stand-alone, configurable bag validation tool. It runs on Mac, Linux, and Windows, has no installation dependencies, and is known to have good performance on large bags.
For help with the new validator, simply run the apt_validate command with no options or parameters.
Each archive contains the five executable files listed below. There is no installer. Simply put the executables where you want them and run them from the command line.
|apt_validate||validate bags (tarred or untarred) before uploading them for ingest (Version 1.03 has a bug that marks some bags with invalid tag manifests as valid. This validator also does not identify some invalid file names. Version 2.0 fixes these bugs.)|
|apt_upload||upload bags to your receiving buckets for ingest|
|apt_list||list the contents of your receiving and restoration buckets|
|apt_download||download restored bags from your restoration buckets|
|apt_delete||delete restored bags from your restoration buckets|
All of the tools except apt_validate require a simple config file with five name-value pairs. The config file format and requirements are the same for 1.03 and 2.0 tools. Note that quotes are optional, and comment lines begin with a hash mark.
# Config for apt_upload and apt_download AwsAccessKeyId = 123456789XYZ AwsSecretAccessKey = THIS KEY INCLUDES SPACES AND DOES NOT NEED QUOTES ReceivingBucket = 'aptrust.receiving.university.edu' RestorationBucket = "aptrust.restore.university.edu" DownloadDir = "/home/josie/downloads"
If you prefer not to put your AWS keys in the config file, you can put them into environment variables called AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
ReceivingBucket is the name of the S3 bucket that will hold your uploaded APTrust bags that are awaiting ingest.
RestorationBucket is the name of the S3 bucket that will hold your restored APTrust bags.
DownloadDir is the name of the local directory in which to save files downloaded from your APTrust restoration bucket. The APTrust config currently does not expand ~ to your home directory, so use an absolute path to be safe.
If you save your config file as ~/.aptrust_partner.conf in Linux/Mac or as %HOMEPATH%\.aptrust_partner.conf under Windows, you will not have to specify a --config option when you run the tools. Otherwise, run the tools with the --config file pointing to the full path of your configuration file. Help
You can view any program's built-in documentation by passing the --help flag.
If you run into problems, send a message to help [at] aptrust.org.
APTrust reporting comes in a variety of different manners, including standard reports, alerts, and various email notifications.
At present, there is one major report available for viewing or download as a pdf. The report provides a breakdown of the content of a single institution. It includes:
- number of ingested intellectual objects,
- number ingested generic files,
- total number of premis events generated for the institution's content,
- total number of work items generated for the institution's content,
- total number of bytes preserved,
- average file size for that institution,
- a breakdown of the amount of content ingested by file type.
The report can be found at https://repo.aptrust.org/reports/overview/:instititon_identifier on the production site and at https://demo.aptrust.org/reports/overview/:institution_identifier on the demo site, under the 'Reports' tab of the top navigation.
There are a variety of alerts available through the web application for institutional administrators to view. The types of alerts available for viewing are:
- Failed fixity checks
- Failed ingests
- Failed restorations
- Failed deletions
- Failed DPN ingests
- Stalled DPN replications
- Stalled work items
Currently (as of September 2017) in development are email notifications that will inform institutional administrators of any failed fixity checks or successful intellectual object restorations as they happen. The emails will come once in a day for a batch of failed fixity checks or successful restorations, rather than one email per event so as not to clog email inboxes.
APTrust is a node in the Digital Preservation Network APTrust is a node in the Digital Preservation Network (DPN)