Preservation and Storage

From aptrust
Jump to: navigation, search

This section covers the regular preservation lifestyle and storage considerations for digital files being preserved by APTrust.  The overall strategy is to store preservation files in S3 with an additional copy stored in Glacier.

Goals

  • Store digital files deposited with APTrust in a secure location and maintain strict management of file life-cycles and actions.
  • Provide a means to store multiple copies of digital files with diversity in the storage layer to mitigate for failures of a particular storage technology.
  • Provide a means to store multiple copies of digital files with geographic diversity to mitigate against the loss of content due to large regional disasters.
  • Confirm the continued integrity of digital files by generating and reporting the results of regular fixity checks on preserved files.
  • Record actions and outcomes of major events related to files managed and preserved by APTrust

Storage

Content preserved by APTrust is stored in a combination of S3 and Glacier for redundancy, mitigation against failure in a particular storage layer as well as geographic redundancy.

Storage Redundancy

The primary preservation copy is considered to be the one maintained in the S3 layer with the 2ndary preservation copy to be used in case of corruption of the primary copy or failure of the the S3 storage layer.  Each storage layer enforces it's own local policies of redundancy with-in the region and storage layer as documented in the AWS FAQ.

Geographic Diversity

The primary s3 storage is part of the US-Standard region and routes copies across the Northern Virginia and Pacific Northwest.

Primary Preservation Storage (S3)

The primary S3 storage bucket is named aptrust.preservation.storage and uses the US-Standard region (see above).  Permissions to this storage bucket are granted to the APTrust administrative account ONLY for proper lifecycle management and never granted to an external service or application.  Any access to items from preservation is only granted by copying the content out to an appropriate staging area after properly authorizing any request.

Preservation Storage Logging

Activity on the preservation bucket is logged using AWS standard logging to a bucket named aptrust.preservation.logging for deeper auditing purposes and security.   This is in addition to any logging already provided by locally coded content services.

Second Tier Preservation Storage (Glacier)

Content stored is the Primary Preservation Storage is replicated to Glacier using AWS Object Lifecycle Management rules with no expiration date on the corresponding object in Primary Preservation Storage.  As noted on that documentation, all views on content stored in Glacier are through S3 so logging is included with the Preservation Storage Logging above.

Preservation Activities

description of the common preservation activities preformed on files

Checksum Requirements

Checksums on Primary Preservation Storage (S3)

Although S3 has a method of auditing and recovery, a manual process for confirming files provides more flexibility and a greater level of assurance.  Files will be regularly copied out of S3 by a locally implemented service to confirm fixity check using both md5 and sha256 values and the outcomes reported in the administrative interface.  Objects failing fixity tests will be retried up to 5 times to ensure it is a true fixity error and not a copy error.

S3 eTag Values

S3 eTag values from AWS will be reported but are NOT used to confirm fixity.  Instead this value is used as an internal identifier for items stored in S3 and as a convenient way to determine if files being processed are duplicates or updates of files already preserved in the system.  

Checksums on Glacier Storage

For regular short term fixity checking we will rely on Glaciers internal SHA256 checksum reporting as a base level enhancement to the S3 fixity checks.  Additionally a service may be developed if needed to manually confirm fixity on a longer timescale (~24 months).  This longer manual fixity confirmation is throttled to use Glaciers slower free IO allotment to recover files form Glacier and confirm the fixity by performing manual MD5 and SHA256 checksums and register the outcome with the Administrative Interface.

MD5 Checksums

MD5 Checksums are generated by depositors as part of their submission preparation process and provided as part of the original submission package in APTrust. 

It provides a means for original depositors to confirm the object preserved by APTrust continue to match the file they deposited exactly with bit-level granularity.

SHA256 Checksums

SHA256 checksums are generated as part of content ingestion into APTrust and is used to mitigate against malicious tampering of files with the intent of obscuring the tampering.  It mitigate against this threat because it is currently a cryptographically secure algorithm that is resistant to tampering.  In the advent that this algorithm is compromised in the future, we should switch to a new secure algorithm.