5.1.2.1 Mechanisms in place to ensure any/multiple copies of digital objects are synchronized

From aptrust
Jump to: navigation, search


5.1.2.1 Mechanisms in place to ensure any/multiple copies of digital objects are synchronized
Status Ready for review
Compliance Rating Fully compliant
Responsible

The repository shall have mechanisms in place to ensure any/multiple copies of digital objects are synchronized.

Supporting Text

This is necessary in order to ensure that multiple copies of a digital object remain identical, within a time established as acceptable by the repository, and that a copy can be used to replace a corrupted copy of the object.

Examples for Meeting the Requirement

Synchronization workflows; system analysis of how long it takes for copies to synchronize; procedures/documentation of synchronization processes.

Discussion

The disaster recovery plan should address what to do should a disaster and an update coincide. For example, if one copy of an object is altered and a disaster occurs while the second is being updated, there needs to be a mechanism to assure that the copy will be updated at the first available opportunity. The mechanisms to synchronize copies of digital objects should be able to detect bit corruption and validate fixity checks before synchronization is attempted.

Evidence Provided

The ingest process assures that digital objects are correctly store in S3 and Glacier. Part of the process is fixity validation before anything gets stored in the repository.

The PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. web application stores a work item list of items that are in the queue for ingest, restore or deletion. In case of errors during either process it is logged there. It is the responsibility of the depositor to check on ingest errors.

Once items are ingested regular (every 90 days) fixity checks occur and depositors get notified per email if errors occur.

See below for details.

The ingest process

The ingest process begins when a depositor uploads a bag (a tar file) to their receiving bucket. The receiving buckets follow the naming convention aptrust.receiving.<member identifier>. APTrust uses domain names as member identifiers, so UVA’s receiving bucket is aptrust.receiving.virginia.edu, UNC’s is aptrust.receiving.unc.edu, etc. The demo system has its own set of receiving buckets, whose names follow the pattern aptrust.receiving.test.<member identifier>.

A cron job called apt_bucket_reader runs every hour or so on apt-prod-services. It scans all of the receiving buckets for new tar files, and for each new file, it creates a WorkItem with action “Ingest,” and it copies that WorkItem’s into NSQ’s apt_fetch topic.

From there, the apt_fetch service on apt-prod-services downloads the file to a staging area, an Elastic Block Storage (EBS) mount attached to apt-prod-services. It reads and validates the bag without untarring it, and if the bag is valid, it pushes the WorkItem ID into NSQ’s apt_store topic.

apt_store copies individual files from the tarred bag into S3 in Northern Virginia and Glacier in Oregon. We generally do not copy files from the tar archive to disk before uploading them. We just read them straight from the tar file into S3. However, due to a poor design choice in the official AWS S3 uploader for Go, we do have to copy files larger than 100MB or so to disk before uploading them to S3 and Glacier. This makes the upload of large file quite slow.

After each file is copied into S3 and then Glacier, apt_store records where and when the file was stored in the in-memory GenericFile record. When all files have been stored to S3 and Glacier, apt_store pushes the WorkItem ID into the apt_record topic of NSQ.

apt_record records a new IntellectualObject record in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata., along with one new GenericFile record for each ingested file. It also records to checksums for each file (an md5 and a sha256), and it records a series of events for object and each of its files. These events include creation, ingestion, identifier assignment, message digest calculation, access assignment (for the object), and replication.

See our Technical Documentation, Technical Workflows, and the Ingest section for additional information.