Architecture

From aptrust
Jump to: navigation, search
  • EC2 Instances: Serves the foundational server environment where all non-native AWS services, software and applications are deployed. Major components of the infrastructure running on EC2 instances are:
    • Exchange (Content Processing Scripts): Go language services that manage work queues to monitor for the arrival of content, process content, register metadata with PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. in it's PostgresSQL database and move content to preservation storage. Additionally a set of Go scripts will also manage work queue around file life-cycle and processing (fixity) as well as restoration by re-packaging intellectual objects and related generic files back int an APTrust BagIT bag.
    • PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. - Our Rails app provides a “registry” describing what IntellectualObjects and GenericFiles are in our repository, along with checksums for those files and PREMIS events describing actions taken on those objects and files. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. also keeps track of WorkItems, which are requests to do something with an object of file. DPN replication and restore requests have their own table in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata., dpn_work_items. Finally, PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. stores blobs of compressed JSON in the WorkItemState table, which is described below. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. stores its data in Multi-AZ Postgres RDS instance that contains local or workflow data about object processing, user data for admin interface authentication, as well as generic file metadata.

APTrust services run on apt-demo-services for our demo environment, and on apt-prod-services for our production environment. Each environment has separate receiving, restoration, and storage areas in S3.

  • S3 (Northern VA) - Partners upload new bags to the receiving buckets here, which we query every hour or so. Partners download restored bags from the restore buckets here, which is where our apt_restore service drops restored bags for depositors. The bucket called aptrust.preservation.storage is where we store ingested files for the long term. That bucket is accessible to the APTrust admin account only. The buckets are distinguished by use:
    • Receiving Buckets: Each APTrust member has an individual S3 bucket designated for the upload of submission packages to APTrust and to facilitate the hand-off of content. Access to each bucket is restricted to a designated institution who have PUT and LIST permissions or the APTrust processing scripts which have full permissions
    • Preservation Bucket: A single S3 bucket is used for central preservation storage. Files to be preserved are placed here with pointers to the file stored in the corresponding PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. object along with any relevant metadata.
    • Restoration Buckets: Each APTrust member has an individual S3 bucket designated for the download of distribution packages for restoration. Access to each bucket is restricted to the designated institution who has LIST, GET and DELETE permissions to that bucket as well as full access to the APTrust processing scripts.
  • AWS Glacier (Oregon) - This is where we store replication copies of all ingested files, and is accessible to the APTrust admin account only.
  • apt-prod-service & apt-demo-services - These servers run the processes that perform ingest, file deletion, bag restoration, DPN ingest, and ongoing fixity checks.
  • NSQ (runs on apt-prod-service and apt-demo-services) - cron jobs like apt_queue query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. for outstanding WorkItems and push the WorkItem IDs into the proper NSQ topics. For example, the ID of a WorkItem requesting a deletion goes into the apt_file_delete topic in NSQ.
    • NSQ pushes WorkItem IDs to the workers that subscribe to its channels. apt_file_delete subscribes to the apt_file_delete channel, so it gets the IDs of file deletion WorkItems.
  • The workers query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. to get the full WorkItem record associated with the ID that NSQ pushed to them. The workers also query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. for data associated with that WorkItem, including:
    • The IntellectualObject associated with the WorkItem, if there is one.
    • The GenericFile associated with the WorkItem, if there is one.
    • The WorkItemState associated with the item, if there is one.
Pharos DB ER Diagram
PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. DB ER-Diagram
WorkItemState is a Rails object that contains a blob of JSON describing the state of a WorkItem as it progresses through the queues and workers. The APTrust services are like a series of conveyor belts in a factory. As a WorkItem and its associated IntellectualObject or GenericFile passes from one worker to the next, it is accompanied by a manifest, which the worker fills out as it goes. The manifests are different for each process (ingest, restoration, deletion, etc.) because the work that needs to be done differs.

For example, each of the ingest workers adds information to an ingest manifest as it does its work. The first worker, apt_fetch, records where on the local file system it stored the tar file it just downloaded. It also records the names and checksums of all the GenericFiles it found inside the tar file, and any validation errors it encountered.

When a service like apt_fetch is done working on a WorkItem, it converts its manifest to JSON and sends that data back to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. to be saved in the WorkItemState table. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. compresses the data before saving it, because it can get quite large, especially in the case of ingest, and tends to compress to about 10% of its full size.

When the next worker pick up the WorkItem, it pulls the WorkItemState record from PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. and from that it knows all it needs to know to do its job intelligently. For example, both apt_store, which copies files to long-term storage, and apt_record, which tells PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. what files have been stored, can and do stop work on an item after encountering too many transient errors. (Transient errors are almost always network errors or problems with disk I/O.) They will record the WorkItemState in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. before requeuing the task in NSQ.

The next worker to pick up the task then loads the WorkItemState from PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. and knows what work has been done and what has not. For example, it’s common for the system to record 500 of an IntellectualObject’s 1000 files on the first ingest attempt before running into some network problem. The next apt_store worker to pick up the ingest request will know it doesn’t have to store the first 500 files, and it starts its work at file #501.

Most services write to two logs in the /mnt/efs/apt/logs directory: one called <service>.log, and one called <service>.json. The .log file is meant to be human-readable. The .json file is meant to be machine-readable. In cases where a service was not able to send a WorkItem’s JSON state back to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. (in the form of a WorkItemState object), you should be able to find the JSON in the .json log. A special tool called apt_json_extractor can quickly pull individual JSON records out of very large JSON logs.
APTrust Software Stack
APTrust Software Stack