- 1 Using APTrust
- 1.1 New Institution Ramp-up
- 1.2 Technical Support
- 1.3 Technical Workflows
- 1.3.1 Required Metadata elements
- 1.3.2 Bagging specifications
- 1.3.3 Deletion
- 1.3.4 Ingest
- 1.3.5 Restoration
- 1.3.6 Updating
- 1.3.7 Ongoing Fixity Checks
- 1.4 Administrative Workflows
- 1.5 Adminonly_Technical Documentation
- 1.6 Technical Documentation
- 1.6.1 Architecture
- 1.6.2 Security
- 1.6.3 Introduction
- 1.6.4 Security Principles
- 1.6.5 Authentication & Administration
- 1.6.6 Auditing
- 1.6.7 Business Continuity
- 1.6.8 Risk Management, Threats, and Mitigations
- 1.6.9 Internal Attack
- 1.6.10 Economic Failure
- 1.6.11 Organizational Failure
- 1.7 Relevant Documents
- 1.8 Partner Tools
- 1.9 Reporting
- 1.10 DPN
APTrust provides a complete demo environment where depositors can test depositing and restoring of objects. It is encouraged to use the demo environment first to get familiar with the system and process.
To use APTrust basic features, you'll receive the following for both, demo and the production environment.
- AWS keys. These let you access your receiving bucket, into which you'll upload new bags for ingest, and your restoration bucket, from which you'll download restored bags. Each member has two sets of keys: one for the demo environment, and one for the production environment. You can request AWS keys from [].
- Tools for accessing S3 buckets. Amazon maintains client libraries in commonly-used languages to help you access S3 through the language of your choice. APTrust maintains a set of partner tools for uploading to, downloading from, and listing the contents of your S3 buckets.
- A login for our Web UI. If your organization already has access to our web interfaces at https://demo.aptrust.org/ and https://repo.aptrust.org, then you have a local APTrust administrator who can set up an account for you. If no one in your organization has credentials to these sites, contact [] to get set up.
- An API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program. You'll need this only if you plan on accessing our member APIApplication Programming Interface. Once you have a login for the Web UI, you may generate an API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program using by accessing your users profile page. Click "Generate an API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program" button and and note the generated key in a secure place. For help contact [] for an API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program, and note that you'll need separate keys for the demo and production systems.
New Institution Ramp-up
Welcome to APTrust! By now you have signed the terms and conditions of deposit and are ready to start. To make your ramp-up to using APTrust as quick and easy as possible we have created a checklist to get you started.
|#||Step||You should||APTrust staff will|
|1||Membership agreement||Sign and submit the membership agreement to APTrust|
|2||Provide User information||Provide names, institutional email addresses, and their roles (admin or depositor user) the individuals will fill.
An "admin" user will have the ability to manage other users and allow deletion requests of other users. We employ a two-factor deletion policy that requires two users (admin and regular user) to delete data.
|Use that information to setup accounts on the repository frontend and AWS backend systems.|
|3||Provide credentials to the institutional users||Make sure to store credentials in a safe manner.||Provide:
- Each institutional user (admin or depositor) with an account on the repository frontend systems called "PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes" (https://demo.aptrust.org [demo system] and https://repo.aptrust.org [production system]). Note: only admin users can delete files or objects.
- Each institutional user with individual AWS credentials to be able to deposit and restore objects to S3 buckets. Each user will receive an email with a link to an encrypted note that entails the credential information, AWS S3 restore and deposit buckets. The note will destroy itself after opening for the first time.
|4||Read the documenation||Read the documentation on how to use APTrust.|
|5||Login to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes||Now that you have read the documentation it's a good point to take a look at the repository frontend "PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes". You may have realized that you haven't gotten credentials to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes. That is because you will create them yourself. Go to each system https://demo.aptrust.org and https://repo.aptrust.org and enter the email address you have provided to use and click the "Forgot Password?" link below.|
|6||Create an API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program||Once you are able to login to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes, you can create an API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program in order to use the repositories APIApplication Programming Interface. (You may not need it if you don't want to programmatically access the repository.) Login to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes, and click on "View Profile" in the top right corner on your name.API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program will appear in the top. Make sure to store it safely as you won't be able to read it again. If you lose the key, you will have to re-generate a new key.|
|7||Make a test deposit||Use the "Easy Store" or "Partner Tools" bagging tools to create a test bag to be deposited.|
|8||Perform a test restore||Use the PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes web interface or an API call with the appropriate object identifier.|
This is requited by the BagIt specification, and should contain the following:
BagIt-Version: 0.97 Tag-File-Character-Encoding: UTF-8
Valid APTrust bags MUST contain a bag-info.txt file with the following fields, which may be blank:
Source-Organization: University of Virginia Bagging-Date: 2018-02-27 Bag-Count: 1 Internal-Sender-Description: Twitter captures of the events of August 12 2017 in Charlottesville, VA Internal-Sender-Identifier: AUG12
This file MAY contain additional fields; however, APTrust will not preserve the additional tags in bag-info.txt. Because bag-info.txt may contain a Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. tag, APTrust regenerates this file when you restore a bag. Consider these two situations:
- You upload a bag containing 100 files to APTrust. Then you delete 10 of those files.
- You upload a bag containing 100 files to APTrust. Then you upload a new version of that same bag, overwriting 10 files.
When we restore either one of these bags, the Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. value, which shows the number of bytes and number of files in the payload directory, will not match the Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. of the original bag-info.txt file, and your BagIt validator will show the bag as invalid. For this reason, we regenerate bag-info.txt when restoring the bag.
We do preserve and restore all tag files other than bag-info.txt and bagit.txt, so if your bag includes tags that you want to preserve, put those tags into other tag files.
This file MUST be present and MUST contain the following tag fields.
- Title: Human readable title for searching and listing in APTrust. This cannot be empty.
- Description: A human-readable description of the bag.
- Access: One of three enumerated access conditions. [“Consortia”, “Restricted”, “Institution”]. These access restrictions describe who can see the object metadata, including the object's name and description, a list of its generic files and events. APTrust does not currently provide access to the objects themselves, except when the owning institution restores a bag it owns. In other words, no matter which access setting you choose, no other institution can access your intellectual object. The general public cannot see any information in the APTrust system.
- Restricted: Metadata about this object is accessible to the institutional administrator (at the depositing institution) and to the APTrust admin. No one else can even see that this object exists in the repository.
- Institution: All users at the depositing institution can see metadata about this object.
- Consortia: All APTrust members can see this object's metadata.
- Restricted: Metadata about this object is accessible to the institutional administrator (at the depositing institution) and to the APTrust admin. No one else can even see that this object exists in the repository.
Following a quick checklist for valid bags. For detailed specification go here: Bagging specifications
Valid bags meet all of the following criteria:
- The bag was submitted as a tar file, without compression
- Bag name follows the pattern <institution.edu>.bag_name[.b###.of###].tar.
- Bag untars to a directory whose name matches the name of the tar file, minus the .tar extension.
- Bag contains an md5 or sha256 manifest (or both)
- Bag contains the data directory
- Bag contains bagit.txt, as described above
- Bag contains bag-info.txt as described above
- Bag contains aptrust-info.txt as described above
- All data files are in the manifest, and all checksums matched
- All tag files mentioned in the tag manifest are present, and checksums match (you may omit tag files from the tag manifests)
! This feature is available only to institutional administrators. !
In order to delete an object an institutional administrator has to access the APTrust web interface ( https://demo.aptrust.org or https://repo.aptrust.org ). There is no API endpoint to delete files programmatically. APTrust also uses a double fault deletion system which means all deletion requests will need to be confirmed via email.
After you click Delete, the web interface will send an email to other institutional administrators at your institution, asking for someone to confirm the deletion request via a link provided in the email. Once the link has been clicked, APTrust will create a delete request for each generic file in this intellectual object. You'll see the delete requests listed under the Processed Items tab of the Web UI.
Important Note: If at all possible, the system will send the request to a user other than the one who made the request, ensuring all deletion requests are seen by at least two people. However, if your institution has only one institutional administrator, they will be able to both request and confirm the deletion of an intellectual object or generic file. It is therefore highly encouraged that all institutions have at least two institutional administrators.
You can also delete individual files:
- Locate the Intellectual Object whose file(s) you want to delete.
- Click the View Preserved Files button.
- Click on the name of the generic file you want to delete.
- Click the delete button at the bottom of the page.
- Confirm the deletion of the generic file by email (typically done by a different user)
Once the deletion request has been confirmed, you will see a single delete request under the Processed Items tab.
- You can delete only objects and files that belong to your institution.
- You must be an institutional admin to delete a file or object.
- If a pending request exists to restore an intellectual object, or to send that object to DPN, the system will not allow you to delete the intellectual object or any of its files until those pending operations are complete.
- The deletion request must be reviewed and confirmed by another institutional administrator from that same institution.
Effect of Deletion on Metadata
When you delete a file, PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes creates a deletion PREMIS event for the file that includes the date and time of deletion and the email address of the user who requested the deletion. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes keeps the generic file records, changing it's state from 'Active' to 'Deleted.' PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes also keeps all prior PREMIS events and fixity records related to the file, and the file record remains accessible through both the Web UI and the member API. In addition, PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes keeps a Work Item record that shows when the deletion was requested and when it was completed. Work Items are also available through the Web UI and the member API.
Deletion requests for both files and objects can be made by members of the APTrust technical staff. However, those deletion requests are subject to the same requirements; specifically, that deletion request must still be reviewed by an institutional administrator at the institution to which the file or object belongs. APTrust staff do not receive deletion confirmation emails.
To store an item in APTrust, simply bag it in valid APTrust format and upload it to your institution's APTrust receiving bucket. Your receiving bucket for the demo system is:
The demo system is for testing your workflows and bagging tools. It currently accepts bags no larger than 100MB. It will ignore files larger than that in the receiving buckets.
The receiving bucket for the production system is:
Replace <institution.domain> with your organization's domain name. For example, virginia.edu, jhu.edu, etc.
You'll need AWS keys to complete the upload. If you don't yet have them, contact email@example.com. You can upload bags with [Amazon's S3 CLI tools], or by integrating one of [Amazon's S3 client libraries] into your own tools, or by using APTrust's partner tools.
Note that APTrust does take responsibility for your bag until it is fully ingested. Fully ingested means we have stored all of the bag's contents in our long-term storage areas and have recorded an Intellectual Object record for the bag, Generic File records for each file in the bag's payload, and PREMIS events describing the ingest of the object and the ingest, identifier assignment, and initial fixity values for each of the object's files.
Monitoring the Ingest Process
APTrust periodically scans the receiving buckets and adds new bags to the ingest queue. You can follow the status of all work requests by clicking the Processed Items tab in our Web UI.
Click on an item to view status details.
The ingest process follows these steps:
- Fetch - We retrieve the bag (the tar file) from your receiving bucket.
- Unpack - We untar the bag.
- Validate - We make sure all files are present and match the checksums in the manifests. (The BagIt spec allows you include some custom tag files without mentioning them in the tag manifests. If we find these files, we allow them, but we obviously can't validate their checksums.)
- Store - We copy the files to long-term storage in Virginia.
- Record - We tell PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes what we ingested, recording all generic files and PREMIS events.
- Replicate - We copy all files to Glacier in Oregon and tell PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes where we put them.
- Cleanup - If all succeeded, we delete the original tar file from your receiving bucket.
Once a bag enters the work queue, ingest can take anywhere from one minute to several hours. The two factors that affect ingest time are bag size and system load. When APTrust is flooded with ingest requests, several hours may elapse between your bag appearing in the work queue and Step 1 of the ingest process. Bags containing large amounts of data always take a long time to process, because we have to retrieve the bag from the receiving bucket and calculate md5 and sha256 checksums on all of the bag's contents.
Uploading Multiple Versions of the Same Bag
The ingest process can take anywhere from one minute to 24 hours to complete. During periods when the system is under sustained heavy load, ingest may take 2-3 days.
The two factors that determine how long an ingest will take are:
- The size of the bag.
- The overall load on the ingest servers.
When you upload a bag to your receiving bucket for ingest, our ingest services may take up to one hour to notice the new bag. At that point, ingest services add the new bag to a work queue for processing, which includes the following steps:
- Copy the bag from the S3 receiving bucket to a processing area (a hard drive attached to the ingest server). This can anywhere from a few seconds for a small bag (under 10 MB) to several hours for a large bag (over 1 TB).
- Validation can take a few seconds for a smaller bag (under 10 MB) or several hours for a larger bag (more than 200 GB). The most time-consuming part of bag validation is fixity calculation. We calculate both md5 and sha256 checksums on every file. When APTrust validates the bag, it ensures the following:
- all files are present,
- all checksums match the payload and tag manifests (either an md5 or sha256 payload manifest is required; tag manifests are optional),
- the bag contains no extraneous or illegal files,
- no payload files are missing from the payload manifests.
- Store the bag. This involves copying the bag's files to S3 in Northern Virginia and to Glacier in Oregon. This step can take a minute or less for smaller bags (less than 10 MB) and over 24 hours for large bags containing many files (more than 1TB and more than 5000 files). A 1TB bag containing 30,000 files usually takes longer to store than a 1 TB bag containing 2 files because of limitations in the number of concurrent upload connections supported by our current S3 network library.
- Record the bag. This step involves recording metadata in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes about the bag and its contents. We store an Intellectual Object record for the bag, a Generic File record for each file in the bag, and a number of PREMIS events describing ingest, fixity generation, identifier assignments, etc. Since each file generates six PREMIS events, a bag of 100 files will generate about 700 records (100 files, plus 600 events), while a bag of 10,000 files will generate around 70,000 records. The record phase can take anywhere from 1 second to 20 minutes, depending on the number of files in the bag.
In addition to the four steps listed above, system load may have an impact on ingest time. If you upload a 5MB bag that would normally ingest in under a minute, and there are five 1TB bags in the work queue ahead of your bag, you may have to wait 48 hours or more for some of those large bags to process before the system even attempts to ingest your 5MB bag. This kind of backlog does not happen often but it does happen reliably. APTrust often gets large ingests from multiple institutions in the two weeks preceding the annual Spring meeting, and again in the the two weeks preceding the Fall meeting. In addition, APTrust depositors who are also DPN members may pass several terabytes through APTrust to DPN in the last two weeks of December to meet DPN's deposit deadlines.
Ingest and Stewardship / Responsibility for Materials
The depositor is responsible for their materials until a bag has been fully ingested into APTrust. Fully ingested means APTrust has validated the bag, stored its contents, and created metadata records in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes, including a record for the Intellectual Object, and record for each of a bag's Generic Files, and PREMIS events recording the ingest of the object and each file.
There are several ways to know that a bag has been fully ingested, including:
- The ingest Work Item record for the bag. When this appears in the Web UI or the API results with Stage = Resolve and Status = Success, the bag has been ingested. Note that for bags ingested more than once, you should check the latest Ingest Work Item for the bag.
- A new PREMIS ingestion event has been recorded for the bag. Again, for bags ingested more than once, check for an ingestion event recorded AFTER you uploaded the bag to the receiving bucket.
- You can run the apt_check_ingest tool in the APTrust Partner Tools on your own computer to check our APIApplication Programming Interface. The tool will list ALL ingest attempts for a specified bag, and will tell you whether and when each ingest succeeded or failed. apt_check_ingest can print results in plain text for human consumption, or in JSON format for scripted machine use.
APTrust is responsible for the contents of the bag after the bag has been fully ingested.
If your workflows include deleting your local copy of the bag after upload to APTrust, they should not delete the bag until after APTrust shows it has been successfully ingested. If you delete your local copy of the bag, you should still keep a copy of the files that were packaged in the bag. APTrust was designed to be a component of your preservation strategy, not the sole steward or conservator of any content.
We currently support DPN ingest for APTrust member institutions who are also members of DPN.
More complete documentation will be coming soon.
Restoration of content can be triggered by using the PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes web interface or an API call with the appropriate object identifier. This will initiate a process that repackages the objects back into a bag using the current APTrust BagIt profile. This will allow depositors to maintain a single set of packaging/restoration scripts based on the current bag format. As with the submission bags, content in the data directory will always remain as it was sent to APTrust in the first place. Tag files and top level files that are part of the wrapping bag will always reflect the current APTrust BagIt profile so depositors do not have to keep track of what bag version was used in the past for a specific item. This achieves the following goals:
- Allow institutional admins to request that a specific Intellectual Object and all active child Generic File object files are repackaged for download to their local institution.
- To provide a simple downloadable distribution package that conforms to the current APTrust BagIt profile so depositors only have to maintain a means for translating the current version of the APTrust BagIt spec.
- To return the exact same bits for files deposited in APTrust for preservation.
- To return files in the data directory with the same name and relative path used in the original submission bag.
When content restoration is requested, a distribution package will be created consisting of the original files and metadata written to a BagIt bag conforming to the current APTrust BagIt profile. The files in the data directory will be the exact same name and bits that were sent to APTrust in the submission bag and the metadata written to tag files in the bag adhering to the current APTrust BagIt format. This allows partners to only have to be able to parse a bag in the current APTrust format and give us the flexibility to migrate our content models and metadata more freely in the future.
Note: The files in the restored bag will not have the same owner id, group id, and permissions as the files in the original submission.
Restoration using PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes Web-UITo restore a bag locate the intellectual object to retrieve and click the Restore Object button.
This adds a restoration task to the work queue, which you can monitor by clicking the Processed Items tab, or by calling the Items endpoint of the Member API.
The restoration process can take anywhere from a few minutes to a few hours, depending on the amount of traffic our system is handling and the size of the bag. Larger bags always take longer to restore, because we calculate md5 and sha256 checksums on the bag's contents.
When we restore a bag, we retrieve all of the intellectual object's files from long-term storage, verify the checksums, reassemble the bag, write the manifests, tar up the bag, and leave it in your receiving bucket. When that's done, the Processed Items list will show your bag with a green background, with Action = Restore, Stage = Resolve, and Status = Success, like the first item in the list above.
It's up to you then to retrieve and delete the item from your restoration bucket. Your restoration bucket for the demo system is:
For the production system, it's:
Replace <institution.domain> with your organization's domain name. For example, virginia.edu, jhu.edu, etc. You can download your bags using Amazon's S3 CLI tools, or by integrating one of Amazon's S3 client libraries into your own tools, or by using APTrust's partner tools.
Format and Content of Restored Bags
According to the "Other Files" section of our old documentation, APTrust did not preserve or restore custom tag files. That changed as of March 29, 2016.
When you restore a bag that was ingested after March 29, 2016, the version you get back will have the same contents and format of the bag you originally uploaded, except: Individual files you deleted through our Web UI will not be restored. The restored bag will include the following manifests, even if they were not present in the original bag:
All tag files will be listed in the tag manifests, even those that were omitted from tag manifests in the original upload.
If you uploaded multiple versions of the same bag with the same name, you'll see the following:
- The restored bag contains the latest (last uploaded) version of each file.
- Files that were not included in the later versions of a bag you uploaded multiple times, but were present in earlier versions, will be present in the restored version unless you manually deleted them through our Web UI. (Our policy not to delete your content. You have to do that deliberately.)
- We regenerated the bag-info.txt file to prevent possible conflicts in the Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount
is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. value. Because files in the bag may have been updated or deleted between initial ingest and restoration, the Payload-OxumThe "octetstream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. Intended for machine consumption. value of the original bag-info.txt file may no longer be valid, so we regenerate the entire bag-info.txt file, including only the minimum required tags. This means you may lose some valuable tag data. If you want to preserve tags, put them in a tag file other than bag-info.txt and bagit.txt.
Bags ingested prior to March 29, 2016, will be restored with the same contents as above, except:
- No custom tag files will be restored, because we didn't preserve them.
The time it takes to restore a bag depends on these three items:
- The size of the bag. Bigger bags take longer.
- The overall load on the APTrust servers.
- The number of files in the bag being restored.
A small bag (under 10 MB) with a handful of files may be restored in less than a minute. A large bag (1 TB), may take 12 hours, and even longer if it contains tens of thousands of files.
When you click the restore button to restore a bag/intellectual object, the request goes into a work queue. The system usually acts on the requests within 20 minutes. So if you're restoring a 50MB bag, the process may take 30 minutes: twenty minutes before the system begins to restore the bag, plus ten minutes to assemble it and put it into your restoration bucket.
If the restoration work queue contains many other pending requests, or if APTrust's servers are under heavy load, it may take longer to start working on your restoration request. The system's busiest times tend to be during the two weeks preceding the annual APTrust spring meeting, the two weeks preceding the fall meeting, and the last two weeks of December. During those times of heavy load, a restoration that takes 30 minutes at any other time of year may take 8 hours or more.
APTrust does not version bags. If you want to keep multiple versions of a bag, use a naming convention. For example: virginia.edu.bag_of_photos, virginia.edu.bag_of_photos_V2, virginia.edu.bag_of_photos_V3
When you upload a bag that has the same name as an existing bag, this is what happens:
- If a file in the new bag has the same name as a file in the old bag and the size or the md5 checksum or the sha256 checksum has changed, we overwrite the old file with the new one. You cannot recover the old file.
- If a file in the new bag has the same name as a file in the old bag and the size and checksums have not changed, we do nothing.
- If a file in the new bag did not exist in the old bag, we save it. If a file in the old bag is not present in the new bag, we do not delete it.
This table shows what happens when you upload a new version of a previously ingested bag.
|Old Bag||New Bag||What is stored||Reason|
|bag-info.txt||bag-info.txt (changed)||new version||Contents in new version have changed|
|data/document.pdf||data/document.pdf (unchanged)||old version||The document did not change|
|(file not present)||data/new_image.jpg||new version||File did not exist in old bag, but it's here now|
|data/old_image.jpg||(file not present)||old version||Although this file has been deleted from the new bag, we will not assume you want to delete it from storage. File deletion must be a deliberate act of the depositor.|
This update policy has three important implications:
- If you want to delete files from an ingested bag / intellectual object, you must do that deliberately. Currently, you can delete only through our Web UI.
- When you restore the bag described in the table above, you'll get back both old_image.jpg and new_image.jpg (unless you manually delete one of them before you restore).
- You can update metadata in a bag by uploading only the metadata, as long as there's at least one file in the data directory and the bag is otherwise valid. This may be useful for bags that contain 100GB of data and 100KB of frequently-updated metadata.
- A new PREMIS ingestion event appears with the date of the new ingest.
- Two new PREMIS fixity generation events appear, one with the md5 checksum of the new file, and one with the sha256 checksum.
The PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes file page for the updated file will show the new checksums for the file at the bottom of the page, along with the date on which the new checksums were calculated. Below those will be the older checksums, and the dates they were calculated.
Future fixity checks on the updated files will test against the latest fixity value of the updated files.
Ongoing Fixity Checks
APTrust automatically performs fixity checks on each preserved file every ninety days. We perform the check agains the sha256 checksum that was calculated upon ingest. If a fixity check succeeds, we record a PREMIS event with the date of the check, with the outcome marked as "succeeded", and the actual fixity value from the check preserved in the outcome detail. If the check fails, we record a PREMIS event with the same information, but the outcome is marked as "failed."
Our daily alerts notify administrators at depositing institutions of any failed fixity events related to their files. Our reports allow depositors to find failed fixity events at any time through the PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes Web UI.
Defining a collection
Each member institution requires security credentials to access their S3 storage buckets for ingest and restore. AWS identity and access management (IAM) is used to create and manage keys.
An institution is provided with two sets of access credentials, one for the production system and one for the demo system. At the time of institution setup access credentials are shared with the designated institutional administrator. The credentials are securely shared using privnote.com which provides a one-time link to an encrypted message which destroys itself after opening it. At this point it becomes the institutions responsibility to keep access credentials sufficiently secure. The institution is required to notify us if a key is compromised and needs to be revoked and a new one issued.
As stated in our security principles we granting least privilege—that is, granting only the permissions required to perform a task. Hence access keys only allow institutions to access their four ingest and restore buckets (demo and production) with limited access to write, read and download objects (files) from their buckets. It does not allow access to any other AWS service.
An institution has access to APTrusts' repository frontend "PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes" which allows for various read operations through it's member APIApplication Programming Interface. During on boarding an institution provides a list of users (Name and Email) that shall have access to the repository. An APTrust administrator creates user accounts on the demo and production PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes system and notifies the institutions users to enter their provided email address and click the "Forgot Password?" link on the repositories home page. This generates an email with a one-time link to create their password. We advise users to use a secure password but do not require any minimum password complexity.
At the time of writing the API only provides read endpoints, hence no data modification is possible using API access.
Future developments will improve security and practices as stated in https://www.pivotaltracker.com/story/show/154729518
Types of access from APTrust Staff
There are various types of access to the material deposited into APTrust. These can be broken down into authorized and unauthorized uses by APTrust Staff.
- Deletion (only with permission from admins at depositor institution)
- Create and revoke legitimate user accounts (PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes)
- Create and revoke legitimate AWS access accounts
- all AWS access accounts resolve to an individual
- Delete/alter production materials without depositor request/authorization
- By policy, viewing restricted metadata and files from depositors is not authorized (unless required for system administration and maintenance)
- Recommended practice: Depositors use encryption for sensitive materials
- Alter any depositor content (i.e. metadata and file)
- Moving/expose/distribute a depositor's content without express authorization from depositor
In general UVA staff adheres to the following UVA policy:
- EC2 Instances: Serves the foundational server environment where all non-native AWS services, software and applications are deployed. Major components of the infrastructure running on EC2 instances are:
- Exchange (Content Processing Scripts): Go language services that manage work queues to monitor for the arrival of content, process content, register metadata with PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes in it's PostgresSQL database and move content to preservation storage. Additionally a set of Go scripts will also manage work queue around file life-cycle and processing (fixity) as well as restoration by re-packaging intellectual objects and related generic files back int an APTrust BagIT bag.
- PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes - Our Rails app provides a “registry” describing what IntellectualObjects and GenericFiles are in our repository, along with checksums for those files and PREMIS events describing actions taken on those objects and files. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes also keeps track of WorkItems, which are requests to do something with an object of file. DPN replication and restore requests have their own table in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes, dpn_work_items. Finally, PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes stores blobs of compressed JSON in the WorkItemState table, which is described below. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes stores its data in Multi-AZ Postgres RDS instance that contains local or workflow data about object processing, user data for admin interface authentication, as well as generic file metadata.
APTrust services run on apt-demo-services for our demo environment, and on apt-prod-services for our production environment. Each environment has separate receiving, restoration, and storage areas in S3.
- S3 (Northern VA) - Partners upload new bags to the receiving buckets here, which we query every hour or so. Partners download restored bags from the restore buckets here, which is where our apt_restore service drops restored bags for depositors. The bucket called aptrust.preservation.storage is where we store ingested files for the long term. That bucket is accessible to the APTrust admin account only. The buckets are distinguished by use:
- Receiving Buckets: Each APTrust member has an individual S3 bucket designated for the upload of submission packages to APTrust and to facilitate the hand-off of content. Access to each bucket is restricted to a designated institution who have PUT and LIST permissions or the APTrust processing scripts which have full permissions
- Preservation Bucket: A single S3 bucket is used for central preservation storage. Files to be preserved are placed here with pointers to the file stored in the corresponding PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes object along with any relevant metadata.
- Restoration Buckets: Each APTrust member has an individual S3 bucket designated for the download of distribution packages for restoration. Access to each bucket is restricted to the designated institution who has LIST, GET and DELETE permissions to that bucket as well as full access to the APTrust processing scripts.
- AWS Glacier (Oregon) - This is where we store replication copies of all ingested files, and is accessible to the APTrust admin account only.
- apt-prod-service & apt-demo-services - These servers run the processes that perform ingest, file deletion, bag restoration, DPN ingest, and ongoing fixity checks.
- NSQ (runs on apt-prod-service and apt-demo-services) - cron jobs like apt_queue query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes for outstanding WorkItems and push the WorkItem IDs into the proper NSQ topics. For example, the ID of a WorkItem requesting a deletion goes into the apt_file_delete topic in NSQ.
- NSQ pushes WorkItem IDs to the workers that subscribe to its channels. apt_file_delete subscribes to the apt_file_delete channel, so it gets the IDs of file deletion WorkItems.
- The workers query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes to get the full WorkItem record associated with the ID that NSQ pushed to them. The workers also query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes for data associated with that WorkItem, including:
- The IntellectualObject associated with the WorkItem, if there is one.
- The GenericFile associated with the WorkItem, if there is one.
- The WorkItemState associated with the item, if there is one.
For example, each of the ingest workers adds information to an ingest manifest as it does its work. The first worker, apt_fetch, records where on the local file system it stored the tar file it just downloaded. It also records the names and checksums of all the GenericFiles it found inside the tar file, and any validation errors it encountered.
When a service like apt_fetch is done working on a WorkItem, it converts its manifest to JSON and sends that data back to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes to be saved in the WorkItemState table. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes compresses the data before saving it, because it can get quite large, especially in the case of ingest, and tends to compress to about 10% of its full size.
When the next worker pick up the WorkItem, it pulls the WorkItemState record from PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes and from that it knows all it needs to know to do its job intelligently. For example, both apt_store, which copies files to long-term storage, and apt_record, which tells PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes what files have been stored, can and do stop work on an item after encountering too many transient errors. (Transient errors are almost always network errors or problems with disk I/O.) They will record the WorkItemState in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes before requeuing the task in NSQ.
The next worker to pick up the task then loads the WorkItemState from PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes and knows what work has been done and what has not. For example, it’s common for the system to record 500 of an IntellectualObject’s 1000 files on the first ingest attempt before running into some network problem. The next apt_store worker to pick up the ingest request will know it doesn’t have to store the first 500 files, and it starts its work at file #501.Most services write to two logs in the /mnt/efs/apt/logs directory: one called <service>.log, and one called <service>.json. The .log file is meant to be human-readable. The .json file is meant to be machine-readable. In cases where a service was not able to send a WorkItem’s JSON state back to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes (in the form of a WorkItemState object), you should be able to find the JSON in the .json log. A special tool called apt_json_extractor can quickly pull individual JSON records out of very large JSON logs.
The Academic Preservation Trust is operated by assignment from its Governing Board to the University of Virginia as a cost-recovery activity of the University of Virginia Library in a spirit of collaboration, openness, and transparency. The underlying preservation-storage infrastructure of APTrust is provided by cloud-based vendors whose services are available to the APTrust through the University of Virginia's access to relevant Internet2 Net+ contracts. Over time, access to cloud services may be available to APTrust through other UVA purchasing "vehicles." APTrust's descriptions of specific services include identification of which contracts are involved with each (see APTrust Services and Fees List). Security of vendor-supplied services is generally provided by those vendors and has been reviewed as part of the establishment of ways to purchase the services by the contracting entity (such as Internet2 or the University of Virginia--see, for example, section 5 of the Internet2 contract information related to AWS services). Our focus in this section is on the services and technologies that APTrust directly controls.
[CHRISTIAN: AS DEVELOPMENT OF THIS DOCUMENT CONTINUES, PLEASE SEE MY OUTLINE AT https://docs.google.com/document/d/1hxbR0QnfZAnrrWSCqdAoF5w_NMPyN8SUgNgl-zyX2iI/edit?usp=sharing FOR CATEGORIES WE NEED TO ADDRESS BASED ON AN EVOLVING NDSA DOCUMENT]
- APTrust's IT environment is tightly controlled and access is on a minimum necessary basis.
- Least Privilege: APTrust follows the standard security advice of granting least privilege—that is, granting only the permissions required to perform a task. Otherwise we default to DENY ALL. This means that all access is denied by default and only granted to specific users if absolutely necessary.
- Only designated operators have shell access to servers.
- We use Multi-Factor-Authentication (MFA) wherever available.
- We use password-less logins (SSH keys).
- Password credentials are stored encrypted.
- Defense-in-depth: we employ multiple layers of security controls (defense) to provide redundancy in the event a security control fails.
Authentication & Administration
Authentication is defined as validating a user's identity and basic-level authorization through determining if a user is allowed to access the system.
All services that APTrust provides and third-party services that we use require authenticated access. Wherever possible the means of authentication are securely stored using encryption. If encryption is not possible, we strongly limit access to resources that use plain-text authentication (such as environment variables on infrastructure).
Account provisioning consists of the creation of necessary accounts (PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes user account, AWS IAM accounts for S3 access) to use the APTrust preservation repository.
Only designated APTrust administrators create accounts and provide credentials to the account owners in a secure fashion. At the time of writing there is only one institutional AWS IAM account to provide access to S3 storage (ingest and restore buckets). The APTrust administrator provides credentials per privnote.com.
Accounts are only created for users that have been specifically identified by a designated institutional administrator. User emails are required to be under the institution's domain, e.g. Institutions: University of Virginia User: Christian Dahlhausen; Email: firstname.lastname@example.org
Account Access Termination
Institutional member accounts: Once credentials have been provided the institution is responsible to store them safely and report if a compromise of the credential has been likely. If an credential is compromised the APTrust administrator deactivates the credentials and issues new ones to the institution. The institutional member is responsible for sub-accounts on APTrust PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes, either subscriber accounts or institutional member accounts. This includes but is not limited to:
- PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes accounts and API keys
- AWS access keys
- Google document access
APTrust staff accounts: When there is APTrust staff attrition one of the APTrust administrators deactivates/deletes all accounts that pertain to access of internal and external systems. This includes but is not limited to:
- APTrust server accounts (SSH keys)
- AWS credentials
- Passpack account
- Google Apps account
- PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes account
Account Deprovisioning details the disposition of the account in the service, and management of the user-generated data (if any) related to the account.
this particularly pertains to subscriber institutions when the member institution leaves APtrust...
Roles and Authorization
Following designated user groups with varying level of authorization. The table follows the CRUD (create-read-update-delete) to denote permissive authorization.
Access Control Granularity
Access and Identity Tools
AWS Identity and Access Management (IAM)
AWS Access credentials
For access of ingest and restore storage buckets (S3) a user requires AWS Access Credentials (AWS ACCESS KEY and AWS SECRET ACCESS KEY). These are generated using AWS IAM. At the time of writing each member institution has one institutional account with two sets of access credentials. One set is for APTrust's production environment and one set is for the demo/development environment. The credentials are shared with the institutional administrator (for APTrust) using privnote. After the credentials have been provided the privnote destroys itself and the institutional administrator is responsible keeping the access credentials secure. APTrust makes stores each credential additionally in Passpack as a backup.
AWS Access policies
Each IAM account has access policies that describe the level of access to services. Member institutions have limited access on their S3 storage buckets and are only allowed certain actions on the data objects. Institutional accounts do not have access to any other AWS services besides S3.
Passpack is an online password management tool that stores all credential data encrypted. APTrust has a main business account (aptrust) that manages all credentials pertaining APTrust used services. Each member of the APTrust operations and development team has their own account to which some of the password entries are shared with. An admin account has access to all APTrust related credentials and is only accessible by one of APTrusts designated administrators. Passpack entries are encrypted using Double Lock -- AES-128 bit encryption by default. It is possible to use higher encryption per credential (Triple Lock -- AES-256 bit encryption). We encourage using the higher encryption standard when new entries are created. APTrust staff is storing AWS access keys of member institutions in that datastore as well.
Workflow: A user is creating a Passpack entry and transfers ownership to the main APTrust admin account. That way the admin can re-share the credentials to other members of the team and manage editing permissions.
Sensitive server and service information is stored using Ansible Vault. Vault encrypts any text file with AES-256 encryption. It uses a single password to encrypt a file. When running Ansible playbooks and roles Ansible prompts for the encryption/decryption password in order to read the vault file(s). For further information on how to use it, read here.
The Ansible Vault password is stored in Passpack.
Monitoring, Reporting, and Alerting
Interopability and Portability
Risk Management describes casting and evaluation of risks together with the identification of procedures to avoid or minimize their impact. Due to the repositories' architecture and usage of Amazon Web Services (AWS) some of the threats and mitigations are covered by a Shared Security Responsibilty Model and hence shared between APTrust and AWS as infrastructure provider.
The system architecture and operations policies of APTrust are based on the threat model which was formalized in a 2005 paper published by the LOCKSS team, Requirements for Digital Preservation Systems: A Bottom-Up Approach, and periodic reviews of code, configuration and policies. The identified threats are not unique to digital preservation repositories but sometimes require different mitigation strategies due to the long time horizon preservation systems are build and maintained for.
The paper identified the following threats:
All storage media must be expected to degrade with time, causing irrecoverable bit errors, and to be subject to sudden catastrophic irrecoverable loss of bulk data such as disk crashes or loss of off-line media.Mitigation
- APTrust stores multiple copies of data objects in multiple locations (Virginia and Oregon) to mitigate localized failures.
- Regular fixity checks ensure the accuracy and authenticity of the data object.
- The storage backend used, AWS Simple Storage Service (S3), are designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities. A second tier storage Amazon Glacier also redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing.
- Storage media is managed by AWS and underlying "disk crashes" mitigated by the vendor as part of their shared security responsibilities.
All hardware components must be expected to suffer transient recoverable failures, such as power loss, and catastrophic irrecoverable failures, such as burnt-out power supplies....all managed by AWS....
Other than media failures, other components of CLOCKSS boxes can fail. Observed failures include:
Power supplies. CLOCKSS boxes have redundant power supplies, so the failure of one does not bring the box down. Motherboards, whose failure does bring the box down. CLOCKSS boxes can be down for extended periods without impairing the function of the network as a whole, as the box in Tokyo was in the aftermath of the Fukushima disaster.
The APTrust Repository does not maintain service contracts for its hardware. It owns the hardware even though it is located at remote sites. The architecture of the CLOCKSS network means that rapid response to outages at individual sites is not required. All CLOCKSS hardware shipped to remote sites is equipped with redundant power supplies, and undergoes extended burn-in before shipment. Experience shows that failures of hardware components other than disks are rare. CLOCKSS boxes are equipped with warm spare disks to cover for disk failures. Non-disk failures are typically handled by exchanging a complete server with one from the LOCKSS team. Thus service contracts are not economically justified.
Risk: There is a risk that delays in repairing hardware failures would result in enough CLOCKSS boxes being down simultaneously to impair the function of the network. This risk is mitigated by monitoring of box operations and treating hardware repair as urgent. The purchase of spare hardware that could immediately be shipped from Stanford to reduce the delay in repair is being investigated.
All software components must be expected to suffer from bugs that pose a risk to the stored data.... regular unit and integration tests after software updates ...
.. future automation for these tests before deployment...
Failures in the LOCKSS software are detected and diagnosed using the logging mechanisms. They are reported, and progress in their remediation tracked, as described in LOCKSS: Software Development Process. Because the LOCKSS daemon on each box operates independently, and because its operations are heavily randomized, it is unlikely that the occurrence of failures on multiple boxes would be correlated in time. If necessary, the various processes performed by each or all LOCKSS daemons, such as collecting content and integrity checks can be individually and temporarily disabled by means of the property server.
The LOCKSS and CLOCKSS networks are so large that full-scale testing in an isolated environment is economically infeasible. Testing in available isolated environments is a part of the process, but it cannot be representative of the load encountered by software in production use. This risk is mitigated by releasing new versions of the LOCKSS software to a number of LOCKSS boxes that the LOCKSS team runs as part of the Global LOCKSS Network (GLN) before they are made generally available or released to the CLOCKSS network. The GLN includes a much larger number of boxes than the CLOCKSS network, although on average each has much less content. Experience shows that problems not detected in an isolated testing environment are most likely to be caused by a large number of boxes rather than by a large amount of content per box.
Risk: There is a risk that bugs in the LOCKSS daemon could overwrite or delete content from a CLOCKSS box. This risk is mitigated in two ways:
Exclusion from the LOCKSS daemon of code that overwrites or deletes files from the repository, so that a bug cannot inappropriately execute it. The LOCKSS: Polling and Repair Protocol, which would detect and repair the damage from another box.
Systems cannot assume that the network transfers they use to ingest or disseminate content will either succeed or fail within a specified time period, or will actually deliver the content unaltered. A recent study "suggests that between one (data) packet in every 16 million packets and one packet in 10 billion packets will have an undetected checksum error."
Ingest. HTTP is not a reliable transport protocol so, as described in CLOCKSS: Ingest Pipeline, content is ingested multiple times by different ingest machines and subsequently the LOCKSS: Polling and Repair Protocol detects and repairs any inconsistencies between the content at the ingest boxes. Preservation. CLOCKSS boxes are configured to use SSL for all communication between them, specifically for the LOCKSS: Polling and Repair Protocol. Certificates are checked at both ends of all connections. Corruption would thus be detected. Interruptions of communication are normal; messages are re-tried until delivered or a specified time-out. Dissemination. Communication problems during dissemination to the re-publishing servers would be detected by checksum verification and re-tried. Risk: The mitigations are assessed as effective against the threat, so the risk is low.
Failure of Network Services
Systems must anticipate that the external network services they use, including resolvers such as those for domain names  and persistent URLs , will suffer both transient and irrecoverable failures both of the network services and of individual entries in them. As examples, domain names will vanish or be reassigned if the registrant fails to pay the registrar, and a persistent URL will fail to resolve if the resolver service fails to preserve its data with as much care as the digital preservation service.
Ingest: Ingest of content via harvest requires the use of DNS and the publisher's Web server. Failure of either is presumed to be transient, and thus to delay but not prevent ingest. Ingest of content via file transfer requires the use of DNS and a file transfer service such as
ftpeither at the publisher or at Stanford. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.
Preservation. A major design goal of the LOCKSS: Polling and Repair Protocol was to avoid all dependencies on external network services, even DNS, since there was no guarantee that the service would continue. Provided it remains possible to route packets to a network address, the failure of other network services would not affect the preservation of CLOCKSS content. Currently, DNS is required during daemon start; a fix has been designed but has yet to be implemented. Dissemination. Transfer of triggered content to the re-publishing servers requires DNS and a file transfer protocol such as
sftp. Failure of either is presumed to be transient, so it would delay but not prevent dissemination.
Risk: If Internet connectivity were to be impossible for many months the content of the individual CLOCKSS boxes would be at significant risk, but this is assessed as a low probability event. The failures would be unlikely to be correlated, so once connectivity was restored the LOCKSS: Polling and Repair Protocol would have a high probability of recovering from them, although it would take some time.
Media & Hardware Obsolescence
All media and hardware components will eventually fail. Before that, they may become obsolete in the sense of no longer being capable of communicating with other system components or being replaced when they do fail. This problem is particularly acute for removable media, which have a long history of remaining theoretically readable if only a suitable reader could be found .
Easy to monitor for obsolescence, since that would be an industry-wide event. Easily replaced with newer physical or virtual resources when necessary. The LOCKSS team monitors the state of the hardware inventory as documented in CLOCKSS: Logging and Records. The APTrust Repository has technical and financial plans in place to replace failed or life-expired hardware.
There are three reasons why ingest machines or CLOCKSS boxes might need to be replaced:
Hardware failure, which would be revealed by the monitoring processes described in CLOCKSS: Logging and Records, CLOCKSS: Ingest Pipeline and CLOCKSS: Box Operations. Resource exhaustion, would be revealed by the monitoring processes described in CLOCKSS: Logging and Records, CLOCKSS: Ingest Pipeline and CLOCKSS: Box Operations. Technological obsolescence, which would be evident through the staff awareness described in Awareness. Monitoring means the risk of missing hardware failure or resource exhaustion is low, and because either would affect only one of the replicas the impact would be low. The risk of technological obsolescence of the hardware is low since it is all generic PC servers with no specialized components.
The technical specifications for the current hardware were drawn up with incremental upgrade over time in mind. The only components that we expect to upgrade in the next 5 years are the disk media. Beyond 5 years is too far ahead to draw up detailed specifications - for example would we want to use ARM-based micro-servers? Spintronic storage media? Named data networking? We can't know yet.
Risk: There is a risk that, when the time comes, financial resources would be inadequate to replace life-expired hardware. This risk is mitigated by:
Assuming a service life (typically 5 years) for equipment that is much less than the equipment is capable of. Using generic, low-cost equipment. The replication inherent in the CLOCKSS PLN, which means that a few boxes could be out of service for some time without impacting the archive's operations.
Similarly, software components will become obsolete. This will often be manifested as format obsolescence when, although the bits in which some data was encoded remain accessible, the information can no longer be decoded from the storage format into a legible form.
free, open-source, industry standard software such as Linux and Java, or internally developed free, open-source software (the LOCKSS daemon), or internally developed tools used for content testing, and diagnosis of the CLOCKSS network's performance. The LOCKSS daemon used to preserve the APTrust Repository's content depends upon:
A POSIX file system. A Java virtual machine, level 6 or above. A set of Java libraries. Changes which prevent the Linux environment satisfying these requirements are considered unlikely in the foreseeable future, and if they were to be envisaged by the Linux community it would only be after open discussion of which the LOCKSS team would be aware (see Awareness above). The LOCKSS software is maintained by the LOCKSS team using processes defined in LOCKSS: Software Development Process. LOCKSS Program technical staff monitor the evolution of the open source ecosystem and, when indicated, routinely migrate the LOCKSS software (for example, from one Java library to another deemed more suitable). The threat of obsolescence is also monitored by the testing processes described in LOCKSS: Software Development Process and CLOCKSS: Ingest Pipeline. Loss of key team members could impact the effectiveness of this; for mitigation see Awareness above. The rest of the stack is maintained by the Linux, Apache and other open source communities. Since all the software is free and open-source, no financial provision other than the normal LOCKSS: Software Development Process funding need be made for its replacement or upgrade.
One consequence of software obsolescence might be format obsolescence. The APTrust Repository implements format migration on access. Doing so depends on the eventual availability of format converters, a topic discussed here.
Risk: The risk of the open source community being unable to sustain the dependencies on the Java virtual machine, some Java libraries, and the availability of a POSIX file system, or the Apache web server, is assessed as low. These basic dependencies have been stable since the LOCKSS prototype nearly 15 years ago. The requirements development process described in LOCKSS: Software Development Process might fail to detect the need for a change from the LOCKSS community or the content being preserved. Changes in the rest of the software stack might trigger a failure of one or more dependencies of the LOCKSS daemon. The unit and functional testing processes described in LOCKSS: Software Development Process are designed to detect this. APTrust Repository income from libraries and publishers might not be adequate for the work needed to adapt to new publishers and conform to the evolution of existing publishers, leading to a backlog of content to be ingested.
One form of software obsolescence would be if the Web evolved in such a way as to prevent the LOCKSS software from collecting, preserving or disseminating current Web content. This is a significant concern for all Web archiving technologies, and the LOCKSS technical staff have been in the forefront of addressing the issue by running workshops at the International Internet Preservation Consortium. See blog posts about the 2013 and 2012 workshops. The Andrew W. Mellon Foundation is funding the LOCKSS Program to work in this area through mid-2014.
Operator actions must be expected to include both recoverable and irrecoverable errors. This applies not merely to the digital preservation application itself, but also to the operating system on which it is running, the other applications sharing the same environment, the hardware underlying them, and the network through which they communicate.
An error by an operator of the Property Server can affect the entire network, but only by:
Interrupting service, which has no deleterious effect because each box caches the most recent set of properties. Distributing a syntactically malformed property file, which will be detected by the boxes and treated as a service interruption. Distributing a syntactically correct property file that sets unsuitable property values. The LOCKSS daemon software is skeptical of property values. Critical properties have range checks and the code takes other defensive measures to ensure that erroneous property values can at worst cause daemon activities such as polling to stop; they cannot cause loss of or damage to content. An error by a daemon or plugin developer, or the daemon or plugin build master, could affect the entire network but the testing and release process is as automated as possible and designed to catch such errors before they get to the network.
Other precautions taken include:
Operator access to CLOCKSS boxes is logged. Administrative actions via the LOCKSS daemon's administrative Web interface cause Alerts (see CLOCKSS: Logging and Records). Risk: Experience of the LOCKSS system in production use shows this is a low risk, in that many such errors have been made with no serious effect.
Since each of the (currently 12) CLOCKSS boxes is configured to contain a complete copy of the Archive's content, a disaster causing the total loss of a few CLOCKSS boxes does not need to be treated as a disaster, merely the routine replacement of a few network nodes as documented in CLOCKSS: Box Operations.
All CLOCKSS triggered content is disseminated via two mirrored Web servers, one at Stanford, California and one at EDINA, Scotland. A disaster at one of these sites would not interrupt service. Mirroring could be easily restored by copying from the unaffected site or from one of the CLOCKSS boxes as documented in CLOCKSS: Extracting Triggered Content. All content triggered from the CLOCKSS network is under a Creative Commons license; there are neither technical nor legal barriers to other, unaffiliated, institutions bringing up additional mirrors.
A disaster affecting the LOCKSS team at Stanford might interrupt service in terms of ingesting new content, since 3 of the 5 presentation ingest machines are located at Stanford. Each of these ingest machines contains the current state of the ingest pipeline, so that replacement machines could be cloned from one of the remaining machines at the cost of a week or two delay in ingest. (See CLOCKSS: Ingest Pipeline). The content of the source ingest pipeline is mirrored off-site.
A disaster affecting the LOCKSS team at Stanford might interrupt service in terms of the "property server" used to manage the CLOCKSS network. The LOCKSS team maintains a hot standby of the property server in Amazon's cloud. (See LOCKSS: Property Server Operations). Each CLOCKSS box caches a complete copy of the contents of the CLOCKSS property server, so a service interruption would be unlikely to affect their operation during the time needed to fail over to the hot standby.
All CLOCKSS documents are preserved in the APTrust Repository, and thus in each of the CLOCKSS boxes.
It appears that the use of the
ServeContent servlet to serve most of the triggered content is preventing the Internet Archive's Wayback Machine preserving it. We plan to investigate possible ways around this issue so that the Internet Archive would also be a re-publishing server for triggered content.
Risk: Given the high risk of a natural disaster in the Bay Area futher attention is needed to maintaining critical data outside the area, not merely off-site.
As described in CLOCKSS: Box Operations, the configuration of each CLOCKSS box was carefully designed to prevent communication except with the other CLOCKSS boxes (enforced using SSL certificate checks at both ends of each connection) and with the CLOCKSS ingest and management machines (using firewall rules). CLOCKSS (and LOCKSS) boxes are single-function servers, there are no other services sharing the machine for an attacker to compromise. An attacker who, perhaps by compromising a machine used by a host institution's administrator, gains access to an individual CLOCKSS box does not compromise the integrity of the network as a whole, since the CLOCKSS boxes do not trust each other. The surface available to an external attacker is thus minimized. An attacker could compromise the CLOCKSS Property Server, and modify the configuration of all boxes in the network. This could impede network operations until control of the property server was restored, but due to the design of the LOCKSS technology it would not result in content in the CLOCKSS boxes being modified or lost permanently. See LOCKSS: Property Server Operations.
Each CLOCKSS box's operating system is maintained current with the CentOS repositories. Some CLOCKSS boxes update automatically from these repositories within 24 hours, some require administrator intervention. This mitigates the risk that an erroneous update from CentOS would impact all CLOCKSS boxes almost simultaneously.
The process by which security requirements for the the LOCKSS software are developed and addressed is described in LOCKSS: Software Development Process. Once a security enhancement for the LOCKSS daemon is released, all CLOCKSS boxes install it automatically within 24 hours.
The following precautions are taken to prevent unauthorized access via a CLOCKSS box's administrative Web interface:
Packet filters prevent access except from the box's host institution's network, and from the LOCKSS team's subnet at Stanford. Access requires HTTPS. Administrative access is logged. Adminstrative actions cause Alerts, see CLOCKSS: Logging and Records. If a attack compromises one or more ingest boxes, the ingest network should be stopped via the property server, all boxes disconnected from the network, the vulnerability diagnosed, and all boxes wiped and their BIOS and operating system re-installed from scratch. Their content should be re-ingested from the publisher.
If an attack compromises one or more production boxes, the production network should be stopped via the property server, the affected boxes disconnected from the network, the vulnerability diagnosed, and the affected boxes wiped and their BIOS and operating system re-installed from scratch. Unless a majority of the production boxes were compromised, the LOCKSS: Polling and Repair Protocol will detect and repair any corruption of their content.
The open source community maintainers could issue a faulty update to a component of the CLOCKSS box stack. The LOCKSS team could issue a faulty software update. These risks are mitigated by the configuration of the CLOCKSS boxes, which prevents communication except with specifically authorized IP addresses, making it difficult for an attacker to exploit a remote vulnerability, and which prevents login access except by host institution administrators.
Internal attack could take one of two forms:
Insider abuse at the CLOCKSS host institutions is limited to affecting a single box, not the preserved content as a whole. This is because each of the (currently 12) CLOCKSS boxes is independently administered; insiders at the host institution have access only to their box. The CLOCKSS boxes do not trust each other, only the consensus of the boxes as a whole. Insider abuse by the LOCKSS team. The policy is that when a new CLOCKSS box is bought up, the LOCKSS staff managing the network have write and administrative access to it via
sudo. All such accesses are logged. Once confidence is achieved in the working relationship with staff at the host institution, this access is terminated. This stage has been achieved with 7/11 remote CLOCKSS boxes. Eventually, the LOCKSS staff will have such access only to the box at Stanford; their access to the other boxes is limited to:
read-only data collection (see CLOCKSS: Box Operations) changes to the LOCKSS daemon configuration (see LOCKSS: Property Server Operations and discussion of Operator Error above) changes to the LOCKSS daemon software (see LOCKSS: Software Development Process), which could introduce malicious code into the network. This risk is mitigated by the use of GitHub's source code control system, which allows code changes to be traced to their authorized committers, and easily rescinded, code signing (CLOCKSS boxes verify the signature on all software, whether from the LOCKSS team or from the CentOS repositories, before installing it), and the staged release process.
With the exception of the Stanford CLOCKSS box, staff at the host institution of each CLOCKSS box have access only to their box, not to any of the others. Their role in changing the system is limited to maintaining the operating system of their box current with the requirements of the network. LOCKSS staff have read-only access to all boxes for monitoring and data collection purposes.
Members of the LOCKSS staff have delineated roles, responsibilities and authorizations regarding making changes to the system as follows:
LOCKSS technical staff can check changes in to the daemon source code repository, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a release candidate built and signed by the LOCKSS build master, and approved for release by the LOCKSS technical lead. The LOCKSS source code control system identifies the author of each change to the system. See LOCKSS: Software Development Process. LOCKSS content staff can check changes in to the plugin source, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a plugin release built and signed by the plugin build master, and approved for release by the plugin lead. See LOCKSS: Software Development Process. Access to the server room containing the Stanford CLOCKSS box, the property server and other critical systems is restricted to the LOCKSS sysadmin and the LOCKSS senior engineers. Similar secure physical locations are required of the other CLOCKSS boxes, see CLOCKSS: Box Operations. Risk: There is a risk that the LOCKSS build master could compromise the build process to introduce malware. Although this would be evident after the damage was done, because the signed package would not correspond to the tagged source, it is hard to see any pro-active mitigation.
The LOCKSS software is maintained by the LOCKSS team, funded jointly by the APTrust Repository and the LOCKSS Alliance. The LOCKSS team has been economically sustainable for more than 5 years solely on this basis without grant funding.
There is a risk that the CLOCKSS administration might commit to preserve publishers whose content is very large without charging them enough to fund the storage necessary for their content. This risk is mitigated by regular reports on system capacity to CLOCKSS administration. Loss of APTrust Repository library members would reduce funding without corresponding reduction in content (as loss of publisher members would) and might make timely hardware replacements difficult. This risk is mitigated by the 30-year history of exponential drops in storage cost per byte, and the existence of 12 complete replicas of the content, which makes the temporary loss of a few replicas while waiting for replacement less important.
For the business aspects of failing over to a successor organization see CLOCKSS: Succession Plan.
If, as part of the CLOCKSS: Succession Plan, it becomes necessary to transfer custody of the content of the APTrust Repository, this could be achieved in multiple ways. The successor organization could take custody of the content and metadata by, among other possible means:
Importing the content exported by a production CLOCKSS box in one of the packaging formats supported by the LOCKSS daemon, including, ZIP, TAR and WARC files. Crawling the content from a production CLOCKSS box using a standard Web crawler such as the Internet Archive's Heritrix. Using shell scripts to traverse the file systems containing the LOCKSS daemon's repository, described in Definition of AIP, to create a different packaging format, then importing that
- Requirements for Digital Preservation Systems: A Bottom-Up Approach
- LOCKSS: A Peer-to-Peer Digital Preservation System
- Attrition Defenses for a Peer-to-Peer Digital Preservation System
- LOCKSS Talks page
- LOCKSS Publications page
- Dr. David S. H. Rosenthal's blog
- CLOCKSS: Box Operations
- LOCKSS: Polling and Repair Protocol
- CLOCKSS: Extracting Triggered Content
- CLOCKSS: Logging and Records
- CLOCKSS: Ingest Pipeline
- LOCKSS: Property Server Operations
- CLOCKSS: Hardware and Software Inventory
- LOCKSS: Software Development Process
- LOCKSS: Format Migration
- CLOCKSS: Succession Plan
- Definition of AIP
- TALAGALA, N. Characterizing Large Storage Systems: Error Behavior and Performance Benchmarks. PhD thesis, CS Div., Univ. of California at Berkeley, Berkeley, CA, USA, Oct. 1999. <http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1066.pdf>
The member-api allows depositors to programmatically query for status of events, files, intellectual objects or items in a work queue.
APTrust breaks bags into individual files upon ingest and saves each file to Amazon's S3 storage in Northern Virginia and to Glacier storage in Oregon. We maintain a central registry describing all intellectual objects, which files belong to each object, and where those files are stored. The registry is a SQL database that is replicated across multiple AWS availability zones in Virginia, and to Oregon. We also store daily and weekly snapshots of the database on an EBS (disk) drive in Amazon's Virginia data center. (This is the SQL database underlying our PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes application, which provides access to data through both a Web UI and a REST APIApplication Programming Interface.)
In addition to the SQL database, we tag each file stored in S3 and Glacier with the following metadata:
- Content Type - This is stored in as a mime type. E.g. image/jpeg, text/plain, etc.
- Bag - This is the name of the intellectual object to which the file belongs.
- Bag Path - This is the original path of the file within the bag (SIP) when we received the bag for ingest.
- Institution - The domain name of the institution that owns the file. For example, virginia.edu, ncsu.edu, etc.
- MD5 - The md5 checksum of the file.
- SHA256 - The sha256 checksum of the file.
If APTrust were to lose its copies of the SQL database in both availability zones in Virginia, and the copy in Oregon, and the backups, we would still know where all of our preservation files are. And based on the metadata attached to those files, we would still be able to reconstruct all ingested bags, and we would know which institutions those bags and files belong to.
The image to the right shows an example of the metadata attached to each file.
The Software Development Styleguide documents practices and policies that APTrust engineering staff adheres to.
APTrust Partner Tools can help you validate bags and manage your AWS buckets. Use the links below download the tools.
This version fixes some minor bugs and standardizes exit codes across all of the tools. It also replaces the underlying S3 library used for uploads and downloads with the official Amazon S3 library. The apt_upload program includes some breaking changes from the last release. Read the README.txt file included in the package.
The version 2.1 tools are identical to the version 1.03 tools, except for the following additions:
- apt_check_ingest, which can tell you the ingest status of a bag from the command line. (Added Nov. 17, 2017)
- 2017-11-28: Added -etag option, added etag to text output, and improved readability of text output.
- apt_validate, which can now validate both APTrust and DPN bags. (Added Feb. 17, 2017)
- 2017-11-28: Fixed EXCHANGE_HOME/GOPATH error message that prevented the tool from running.
- 2017-11-28: Fixed validation of untarred path on Windows.
The 2.1 packages include json configuration files that tell the validator how to validate APTrust and DPN bags.
You won't need to create or validate DPN bags. APTrust does that for you when you push items through us to DPN. The DPN validation config is included as an example of how to write a configuration file to validate bags whose requirements differ from the standard APTrust requirements. Although the configuration does not cover every possible BagIt option, it is a solid first step toward a fast, stand-alone, configurable bag validation tool. It runs on Mac, Linux, and Windows, has no installation dependencies, and is known to have good performance on large bags.
For help with the new validator, simply run the apt_validate command with no options or parameters.
Each archive contains the five executable files listed below. There is no installer. Simply put the executables where you want them and run them from the command line.
|apt_check_ingest||Command-line tool to check the ingest status of a bag.|
|apt_validate||Validate bags (tarred or untarred) before uploading them for ingest (Version 1.03 has a bug that marks some bags with invalid tag manifests as valid. This validator also does not identify some invalid file names. Version 2.0 fixes these bugs.)|
|apt_upload||Upload bags to your receiving buckets for ingest|
|apt_list||List the contents of your receiving and restoration buckets|
|apt_download||Download restored bags from your restoration buckets|
|apt_delete||Delete restored bags from your restoration buckets|
All of the tools except apt_validate require a simple config file with five name-value pairs. The config file format and requirements are the same for 1.03 and 2.0 tools. Note that quotes are optional, and comment lines begin with a hash mark.
# Config for apt_upload and apt_download AwsAccessKeyId = 123456789XYZ AwsSecretAccessKey = THIS KEY INCLUDES SPACES AND DOES NOT NEED QUOTES ReceivingBucket = 'aptrust.receiving.universityname.edu' RestorationBucket = "aptrust.restore.universityname.edu" DownloadDir = "/home/josie/downloads" AptrustApiUser = "email@example.com" AptrustApiKey = "f887afc5e1624eda92ae1a5aecdf210c"
If you prefer not to put your AWS keys in the config file, you can put them into environment variables called
ReceivingBucket: S3 bucket that will hold your uploaded APTrust bags that are awaiting ingest.
RestorationBucket: S3 bucket that will hold your restored APTrust bags.
DownloadDir: The local directory in which to save files downloaded from your APTrust restoration bucket. The APTrust config currently does not expand ~ to your home directory, so use an absolute path to be safe.
AptrustApiKey: Your API keyAPI key is a code passed in by computer programs calling an application programming interface (API) to identify the calling program for the PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes REST APIApplication Programming Interface. This key must match the user email. (That is, firstname.lastname@example.org cannot log in with a key that was issued to email@example.com.)
If you save your config file as
~/.aptrust_partner.conf in Linux/Mac or as
%HOMEPATH%\.aptrust_partner.conf under Windows, you will not have to specify a --config option when you run the tools. Otherwise, run the tools with the --config file pointing to the full path of your configuration file.
You can view any program's built-in documentation by passing the
If you run into problems, send a message to help [at] aptrust.org.
APTrust reporting comes in a variety of different manners, including standard reports, alerts, and various email notifications.
At present, there is one major report available for viewing or download as a pdf. The report provides a breakdown of the content of a single institution. It includes:
- number of ingested intellectual objects,
- number ingested generic files,
- total number of premis events generated for the institution's content,
- total number of work items generated for the institution's content,
- total number of bytes preserved,
- average file size for that institution,
- a breakdown of the amount of content ingested by file type.
The report can be found at https://repo.aptrust.org/reports/overview/:instititon_identifier on the production site and at https://demo.aptrust.org/reports/overview/:institution_identifier on the demo site, under the 'Reports' tab of the top navigation.
There are a variety of alerts available through the PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomes web application for institutional administrators to view. The types of alerts available for viewing are:
- Failed fixity checks
- Failed ingests
- Failed restorations
- Failed deletions
- Failed DPN ingests
- Stalled DPN replications
- Stalled work items
As of September 2017 email notifications that inform institutional administrators of any failed fixity checks or successful intellectual object restorations as they happen. The emails will come once in a day for a batch of failed fixity checks or successful restorations, rather than one email per event so as not to clog email inboxes.
Double Fault Deletion
Deletion of an intellectual object triggers an email sent to institutional administrators at the parent institution in order to confirm the deletion request.
APTrust is a node in the Digital Preservation Network APTrust is a node in the Digital Preservation Network (DPN)