Technical Documentation

From aptrust
Jump to: navigation, search

Architecture

APTrust Architecture
APTrust Architecture
  • EC2 Instances: Serves the foundational server environment where all non-native AWS services, software and applications are deployed. Major components of the infrastructure running on EC2 instances are:
    • Exchange (Content Processing Scripts): Go language services that manage work queues to monitor for the arrival of content, process content, register metadata with PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. in it's PostgresSQL database and move content to preservation storage. Additionally a set of Go scripts will also manage work queue around file life-cycle and processing (fixity) as well as restoration by re-packaging intellectual objects and related generic files back int an APTrust BagIT bag.
    • PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. - Our Rails app provides a “registry” describing what IntellectualObjects and GenericFiles are in our repository, along with checksums for those files and PREMIS events describing actions taken on those objects and files. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. also keeps track of WorkItems, which are requests to do something with an object of file. DPN replication and restore requests have their own table in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata., dpn_work_items. Finally, PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. stores blobs of compressed JSON in the WorkItemState table, which is described below. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. stores its data in Multi-AZ Postgres RDS instance that contains local or workflow data about object processing, user data for admin interface authentication, as well as generic file metadata.

APTrust services run on apt-demo-services for our demo environment, and on apt-prod-services for our production environment. Each environment has separate receiving, restoration, and storage areas in S3.

  • S3 (Northern VA) - Partners upload new bags to the receiving buckets here, which we query every hour or so. Partners download restored bags from the restore buckets here, which is where our apt_restore service drops restored bags for depositors. The bucket called aptrust.preservation.storage is where we store ingested files for the long term. That bucket is accessible to the APTrust admin account only. The buckets are distinguished by use:
    • Receiving Buckets: Each APTrust member has an individual S3 bucket designated for the upload of submission packages to APTrust and to facilitate the hand-off of content. Access to each bucket is restricted to a designated institution who have PUT and LIST permissions or the APTrust processing scripts which have full permissions
    • Preservation Bucket: A single S3 bucket is used for central preservation storage. Files to be preserved are placed here with pointers to the file stored in the corresponding PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. object along with any relevant metadata.
    • Restoration Buckets: Each APTrust member has an individual S3 bucket designated for the download of distribution packages for restoration. Access to each bucket is restricted to the designated institution who has LIST, GET and DELETE permissions to that bucket as well as full access to the APTrust processing scripts.
  • AWS Glacier (Oregon) - This is where we store replication copies of all ingested files, and is accessible to the APTrust admin account only.
  • apt-prod-service & apt-demo-services - These servers run the processes that perform ingest, file deletion, bag restoration, DPN ingest, and ongoing fixity checks.
  • NSQ (runs on apt-prod-service and apt-demo-services) - cron jobs like apt_queue query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. for outstanding WorkItems and push the WorkItem IDs into the proper NSQ topics. For example, the ID of a WorkItem requesting a deletion goes into the apt_file_delete topic in NSQ.
    • NSQ pushes WorkItem IDs to the workers that subscribe to its channels. apt_file_delete subscribes to the apt_file_delete channel, so it gets the IDs of file deletion WorkItems.
  • The workers query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. to get the full WorkItem record associated with the ID that NSQ pushed to them. The workers also query PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. for data associated with that WorkItem, including:
    • The IntellectualObject associated with the WorkItem, if there is one.
    • The GenericFile associated with the WorkItem, if there is one.
    • The WorkItemState associated with the item, if there is one.
Pharos DB ER Diagram
PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. DB ER-Diagram
WorkItemState is a Rails object that contains a blob of JSON describing the state of a WorkItem as it progresses through the queues and workers. The APTrust services are like a series of conveyor belts in a factory. As a WorkItem and its associated IntellectualObject or GenericFile passes from one worker to the next, it is accompanied by a manifest, which the worker fills out as it goes. The manifests are different for each process (ingest, restoration, deletion, etc.) because the work that needs to be done differs.

For example, each of the ingest workers adds information to an ingest manifest as it does its work. The first worker, apt_fetch, records where on the local file system it stored the tar file it just downloaded. It also records the names and checksums of all the GenericFiles it found inside the tar file, and any validation errors it encountered.

When a service like apt_fetch is done working on a WorkItem, it converts its manifest to JSON and sends that data back to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. to be saved in the WorkItemState table. PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. compresses the data before saving it, because it can get quite large, especially in the case of ingest, and tends to compress to about 10% of its full size.

When the next worker pick up the WorkItem, it pulls the WorkItemState record from PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. and from that it knows all it needs to know to do its job intelligently. For example, both apt_store, which copies files to long-term storage, and apt_record, which tells PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. what files have been stored, can and do stop work on an item after encountering too many transient errors. (Transient errors are almost always network errors or problems with disk I/O.) They will record the WorkItemState in PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. before requeuing the task in NSQ.

The next worker to pick up the task then loads the WorkItemState from PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. and knows what work has been done and what has not. For example, it’s common for the system to record 500 of an IntellectualObject’s 1000 files on the first ingest attempt before running into some network problem. The next apt_store worker to pick up the ingest request will know it doesn’t have to store the first 500 files, and it starts its work at file #501.

Most services write to two logs in the /mnt/efs/apt/logs directory: one called <service>.log, and one called <service>.json. The .log file is meant to be human-readable. The .json file is meant to be machine-readable. In cases where a service was not able to send a WorkItem’s JSON state back to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. (in the form of a WorkItemState object), you should be able to find the JSON in the .json log. A special tool called apt_json_extractor can quickly pull individual JSON records out of very large JSON logs.
APTrust Software Stack
APTrust Software Stack

Security

Introduction

The Academic Preservation Trust is operated by assignment from its Governing Board to the University of Virginia as a cost-recovery activity of the University of Virginia Library in a spirit of collaboration, openness, and transparency. The underlying preservation-storage infrastructure of APTrust is provided by cloud-based vendors whose services are available to the APTrust through the University of Virginia's access to relevant Internet2 Net+ contracts. Over time, access to cloud services may be available to APTrust through other UVA purchasing "vehicles."  APTrust's descriptions of specific services include identification of which contracts are involved with each (see APTrust Services and Fees List). Security of vendor-supplied services is generally provided by those vendors and has been reviewed as part of the establishment of ways to purchase the services by the contracting entity (such as Internet2 or the University of Virginia--see, for example, section 5 of the Internet2 contract information related to AWS services). Our focus in this section is on the services and technologies that APTrust directly controls.

[CHRISTIAN: AS DEVELOPMENT OF THIS DOCUMENT CONTINUES, PLEASE SEE MY OUTLINE AT https://docs.google.com/document/d/1hxbR0QnfZAnrrWSCqdAoF5w_NMPyN8SUgNgl-zyX2iI/edit?usp=sharing FOR CATEGORIES WE NEED TO ADDRESS BASED ON AN EVOLVING NDSA DOCUMENT]

Security Principles

  • APTrust's IT environment is tightly controlled and access is on a minimum necessary basis.
  • Least Privilege: APTrust follows the standard security advice of granting least privilege—that is, granting only the permissions required to perform a task. Otherwise we default to DENY ALL. This means that all access is denied by default and only granted to specific users if absolutely necessary.
  • Only designated operators have shell access to servers.
  • We use Multi-Factor-Authentication (MFA) wherever available.
  • We use password-less logins (SSH keys).
  • Password credentials are stored encrypted.
  • Defense-in-depth: we employ multiple layers of security controls (defense) to provide redundancy in the event a security control fails.

Authentication & Administration

Authentication is defined as validating a user's identity and basic-level authorization through determining if a user is allowed to access the system.

All services that APTrust provides and third-party services that we use require authenticated access. Wherever possible the means of authentication are securely stored using encryption. If encryption is not possible, we strongly limit access to resources that use plain-text authentication (such as environment variables on infrastructure).

Account Provisioning

Account provisioning consists of the creation of necessary accounts (PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. user account, AWS IAM accounts for S3 access) to use the APTrust preservation repository.

Only designated APTrust administrators create accounts and provide credentials to the account owners in a secure fashion. Many institutions have been receving only two institutional AWS IAM accounts (for demo and production) to provide access to S3 storage (ingest and restore buckets). As of 2018, any new institutions will get individual user accounts and generic institution accounts only on request. This provides a better audit trail and enhances security while lowering a blast radius. The APTrust administrator provides credentials per privnote.com.

Accounts are only created for users that have been specifically identified by a designated institutional administrator. User emails are required to be under the institution's domain, e.g. Institutions: University of Virginia User: Christian Dahlhausen; Email: christian.dahlhausen@virginia.edu

APTrust Staff:

Institutional Staff:

Account Access Termination

Institutional member accounts: Once credentials have been provided the institution is responsible to store them safely and report if a compromise of the credential has been likely. If an credential is compromised the APTrust administrator deactivates the credentials and issues new ones to the institution. The institutional member is responsible for sub-accounts on APTrust PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata., either subscriber accounts or institutional member accounts. This includes but is not limited to:

  • PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. accounts and API keys
  • AWS access keys
  • Google document access

APTrust staff accounts: When there is APTrust staff attrition one of the APTrust administrators deactivates/deletes all accounts that pertain to access of internal and external systems. This includes but is not limited to:

  • APTrust server accounts (SSH keys)
  • AWS credentials
  • Passpack account
  • Google Apps account
  • PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. account
Account Deprovisioning

Account Deprovisioning details the disposition of the account in the service, and management of the user-generated data (if any) related to the account.

this particularly pertains to subscriber institutions when the member institution leaves APtrust...

Roles and Authorization

Following designated user groups with varying level of authorization. The table follows the CRUD (create-read-update-delete) to denote permissive authorization.

Role Description PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. AWS S3 Google Apps Wiki ???
Anonymous Users A non-authenticated or unidentified visitor to the site. - - - R
Authenticated Users Any user that has an account on any of our systems. This includes vendors. CRUD
Institutional Member Any user that is an institutional member of APTrust by definition of the depositor agreement. CRUD1 CRUD2 - CRUD
Institutional Admin Any user that is an institutional member of APTrust by definition of the depositor agreement. Inst. Admin has full access to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. and is able to create and delete user accounts for it's institution. CRUD1 CRUD2 CRUD
APTrust Staff CRUD
-- General Staff General administrative staff R R R CRUD
-- Technical Staff Technical staff with some access to APTrust systems. CRUD CRUD RU CRUD
-- Administrator/Operator Technical staff with full administrative access. Has full access for troubleshooting and management of the applications. Is limited to a few lead or Sr roles in APTrust CRUD CRUD CRUD CRUD
Access Control Granularity
Access Permissions in APTrust PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata.
Consortial Institution Restricted
APTrust Consortial Members Institution-Specific Restricted None
Role Type Own Other Own Other Own Other Own Other
Anonymous Discover No No No No No No No No
View No No No No No No No No
Edit No No No No No No No No
Authenticated Discover No No No No No No No No
View No No No No No No No No
Edit No No No No No No No No
APTrust Admin Discover Yes Yes Yes Yes Yes Yes Yes Yes
View Yes Yes Yes Yes Yes Yes Yes Yes
Edit Yes Yes Yes Yes Yes Yes Yes Yes
Insitutional Admin Discover Yes Yes Yes No Yes No No No
View Yes Yes Yes No Yes No No No
Edit No No No No No No No No
Institutional User Discover Yes Yes Yes No Yes No No No
View Yes Yes Yes No No No No No
Edit No No No No No No No No

Access and Identity Tools

AWS Identity and Access Management (IAM)

AWS Access credentials

For access of ingest and restore storage buckets (S3) a user requires AWS Access Credentials (AWS ACCESS KEY and AWS SECRET ACCESS KEY). These are generated using AWS IAM. At the time of writing each member institution has one institutional account with two sets of access credentials. One set is for APTrust's production environment and one set is for the demo/development environment. The credentials are shared with the institutional administrator (for APTrust) using privnote. After the credentials have been provided the privnote destroys itself and the institutional administrator is responsible keeping the access credentials secure. APTrust makes stores each credential additionally in Passpack as a backup.

AWS Access policies

Each IAM account has access policies that describe the level of access to services. Member institutions have limited access on their S3 storage buckets and are only allowed certain actions on the data objects. Institutional accounts do not have access to any other AWS services besides S3.

Passpack

Passpack is an online password management tool that stores all credential data encrypted. APTrust has a main business account (aptrust) that manages all credentials pertaining APTrust used services. Each member of the APTrust operations and development team has their own account to which some of the password entries are shared with. An admin account has access to all APTrust related credentials and is only accessible by one of APTrusts designated administrators. Passpack entries are encrypted using Double Lock -- AES-128 bit encryption by default. It is possible to use higher encryption per credential (Triple Lock -- AES-256 bit encryption). We encourage using the higher encryption standard when new entries are created. APTrust staff is storing AWS access keys of member institutions in that datastore as well.

Workflow: A user is creating a Passpack entry and transfers ownership to the main APTrust admin account. That way the admin can re-share the credentials to other members of the team and manage editing permissions. 

Ansible Vault

Sensitive server and service information is stored using Ansible Vault.  Vault encrypts any text file with AES-256 encryption. It uses a single password to encrypt a file. When running Ansible playbooks and roles Ansible prompts for the encryption/decryption password in order to read the vault file(s).  For further information on how to use it, read here.

The Ansible Vault password is stored in Passpack. 

Auditing

Logging

PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. Application Server Logs

The Ruby application server logs activity in its application logs. Most errors are presented to the user on the web frontend itself, some are only logged in the application logs that are not accessible by the end-user. APTrust staff is spot checking server logs occasionally and errors are reported by Logwatch per email.

Nginx Webserver Logs

The web server logs access to PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. and errors from the Ruby application server. These are parsed and evaluated by Logwatch as well and APTrust staff is notified per email if non-regular activities occur.

Exchange - Ingest/Restore/Fixity Services logging
APTrust keeps extensive logs from each micro service from the Exchange suite. Below an example excerpt from downloading an object from ingest in order to be processed on the server. Each result dictionary tracks data and timestamps about the step in the ingest process.
 1 -------- BEGIN aptrust.receiving.templateuniversity.edu/templateuniversity.edu.10822_1046084.tar | Etag: fef0a540264e38912e9f6c4bc3268d9e | Time: 2018-04-09T16:16:07Z --------
 2  {
 3   "WorkItemId": 181485,
 4   "S3Bucket": "aptrust.receiving.templateuniversity.edu",
 5   "S3Key": "templateuniversity.edu.10822_1046084.tar",
 6   "ETag": "fef0a540264e38912e9f6c4bc3268d9e",
 7   "BagPath": "/mnt/lvm/apt/data/templateuniversity.edu/templateuniversity.edu.10822_1046084.tar",
 8   "DBPath": "/mnt/lvm/apt/data/templateuniversity.edu/templateuniversity.edu.10822_1046084.valdb",
 9   "FetchResult": {
10     "Attempted": true,
11     "AttemptNumber": 1,
12     "ErrorIsFatal": false,
13     "Errors": [],
14     "StartedAt": "2018-04-09T16:16:05.215809131Z",
15     "FinishedAt": "2018-04-09T16:16:06.6678436Z",
16     "Retry": true
17   },
18   "UntarResult": {
19     "Attempted": false,
20     "AttemptNumber": 0,
21     "ErrorIsFatal": false,
22     "Errors": [],
23     "StartedAt": "0001-01-01T00:00:00Z",
24     "FinishedAt": "0001-01-01T00:00:00Z",
25     "Retry": true
26   },
27   "ValidateResult": {
28     "Attempted": true,
29     "AttemptNumber": 1,
30     "ErrorIsFatal": false,
31     "Errors": [],
32     "StartedAt": "2018-04-09T16:16:06.906390067Z",
33     "FinishedAt": "2018-04-09T16:16:07.442069031Z",
34     "Retry": true
35   },
36   "StoreResult": {
37     "Attempted": false,
38     "AttemptNumber": 0,
39     "ErrorIsFatal": false,
40     "Errors": [],
41     "StartedAt": "0001-01-01T00:00:00Z",
42     "FinishedAt": "0001-01-01T00:00:00Z",
43     "Retry": true
44   },
45   "RecordResult": {
46     "Attempted": false,
47     "AttemptNumber": 0,
48     "ErrorIsFatal": false,
49     "Errors": [],
50     "StartedAt": "0001-01-01T00:00:00Z",
51     "FinishedAt": "0001-01-01T00:00:00Z",
52     "Retry": true
53   },
54   "CleanupResult": {
55     "Attempted": false,
56     "AttemptNumber": 0,
57     "ErrorIsFatal": false,
58     "Errors": [],
59     "StartedAt": "0001-01-01T00:00:00Z",
60     "FinishedAt": "0001-01-01T00:00:00Z",
61     "Retry": true
62   },
63   "Object": {
64     "state": "A",
65     "created_at": "0001-01-01T00:00:00Z",
66     "updated_at": "0001-01-01T00:00:00Z",
67     "ingest_deleted_from_receiving_at": "0001-01-01T00:00:00Z"
68   }
69 }
70  -------- END aptrust.receiving.templateuniversity.edu/templateuniversity.edu.10822_1046084.tar | Etag: fef0a540264e38912e9f6c4bc3268d9e | Time: 2018-04-09T16:16:07Z --------
Preservation Storage Logging

Activity on the preservation bucket is logged using AWS standard logging to a bucket named `aptrust.preservation.logging` for deeper auditing purposes and security.  This is in addition to any logging already provided by locally coded content services.

AWS Cloudtrail Logs
AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account. With CloudTrail, you can log, continuously monitor, and retain account activity related to actions across your AWS infrastructure. CloudTrail provides event history of your AWS account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. This event history simplifies security analysis, resource change tracking, and troubleshooting.[1]
APTrust keeps and "AuditTrail" log that stores activity in all regions, all management activities and all activities on all S3 buckets. The log is continuously stored in an S3 bucket "cloudtrail-logs" and is used as a complete audit trail of all AWS related activities. The trail is currently not processed or triggers any alarms.

Monitoring and Alerting

APTrust is using multiple monitoring tools to ensure the systems stability and insight into resource usage.

Icinga2

The Icinga2 monitoring suite is based on de-facto-industry standard Nagios but has a couple of additional features. It alerts APTrust staff by Slack notification about service interruptions or performance issues upon which the ops team can react. Each alert can be "acknowledged" using the Icinga2 web interface. Documenting text can be added to the acknowledgement. Icinga2 also provides a history of past alerts for auditing purposes.

Grafana & InfluxDB

InfluxDB provide time-series data about resource usage and performance. Grafana is a web-frontend and dashboard to visualize that data. Data is fed from an Icinga2 plugin that runs on every instance. It also polls AWS directly. With the time-series data the operations team can identify trends in resource usage and act accordingly. It aids in scaling decisions and resource usage over time.

Fail2ban

Fail2Ban scans SSH and Nginx log files to identify malicious actions. Once IP's have been determined to be malicious the tool puts them in a `jail` (server-local IPtables firewall jail) to prevent any further malicious activity from that IP.

Logwatch

Logwatch is a customizable log analysis system. Logwatch parses through your system's logs and creates a report analyzing areas that you specify. A daily cron job runs the program and sends out a summary email from each node. The operations staff is spot reviewing the logwatch report emails. An example as follows:

---------- Forwarded message ----------
From: Cron Daemon <ops@aptrust.org>
Date: Sun, Nov 5, 2017 at 7:00 PM
Subject: Cron <root@apt-demo-repo2> /usr/sbin/logwatch
To: ops@aptrust.org



 ################### Logwatch 7.4.0 (05/29/13) ####################
        Processing Initiated: Mon Nov  6 00:00:02 2017
        Date Range Processed: yesterday
                              ( 2017-Nov-05 )
                              Period is day.
        Detail Level of Output: 5
        Type of Output/Format: stdout / text
        Logfiles for Host: apt-demo-repo2
 ##################################################################

 --------------------- Cron Begin ------------------------

 Commands Run:
    User root:
          cd / && run-parts --report /etc/cron.hourly: 24 Time(s)
       /usr/sbin/logwatch: 1 Time(s)
       test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(3600))' && certbot -q renew: 2 Time(s)
       test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ): 1 Time(s)
       test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ): 1 Time(s)
    User ubuntu:
       . $HOME/.profile; /var/www/demo.aptrust.org/pharos/current/bin/pharos_notify.py >> /var/www/demo.aptrust.org/pharos/current/log/cron_pharos_notify.log 2>&1: 1440 Time(s)
       /etc/psql_backup/pg_backup_rotated.sh >> /var/log/psql_backup.log 2>&1: 1 Time(s)

 ---------------------- Cron End -------------------------


 --------------------- httpd Begin ------------------------

 144.53 MB transferred in 22982 responses  (1xx 0, 2xx 22944, 3xx 36, 4xx 2, 5xx 0)
        4 Images (0.02 MB),
        4 Documents (0.04 MB),
     8463 Content pages (41.26 MB),
        1 Redirects (0.00 MB),
    14510 Other (103.21 MB)

 Requests with error response codes
    400 Bad Request
       /w00tw00t.at.ISC.SANS.DFind:): 1 Time(s)
    404 Not Found
       /a2billing/admin/Public/index.php: 1 Time(s)

 A total of 5 ROBOTS were logged

 ---------------------- httpd End -------------------------


 --------------------- pam_unix Begin ------------------------

 cron:
    Sessions Opened:
       ubuntu: 1441 Time(s)
       root: 29 Time(s)

 sshd:
    Sessions Opened:
       cd3ef: 1 Time(s)

 sudo:
    Sessions Opened:
       root -> root: 36 Time(s)
       root -> ubuntu: 6 Time(s)


 ---------------------- pam_unix End -------------------------


 --------------------- Postfix Begin ------------------------

 ****** Summary *************************************************************************************

    5.010K  Bytes accepted                               5,130
    5.010K  Bytes sent via SMTP                          5,130
 ========   ==================================================

        1   Accepted                                   100.00%
 --------   --------------------------------------------------
        1   Total                                      100.00%
 ========   ==================================================

        1   Removed from queue
        1   Sent via SMTP

 ****** Detail (1) **********************************************************************************

        1   Sent via SMTP ---------------------------------------------------------------------------
        1      aptrust.org

 === Delivery Delays Percentiles ============================================================
                     0%       25%       50%       75%       90%       95%       98%      100%
 --------------------------------------------------------------------------------------------
 Before qmgr       0.01      0.01      0.01      0.01      0.01      0.01      0.01      0.01
 In qmgr           0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
 Conn setup        0.17      0.17      0.17      0.17      0.17      0.17      0.17      0.17
 Transmission      0.29      0.29      0.29      0.29      0.29      0.29      0.29      0.29
 Total             0.48      0.48      0.48      0.48      0.48      0.48      0.48      0.48
 ============================================================================================

 ---------------------- Postfix End -------------------------


 --------------------- SSHD Begin ------------------------


 Users logging in through sshd:
    cd3ef:
       216.197.76.234 (c-va-3d8452d8df-23486-1.tingfiber.com): 1 time


 Received disconnect:
    11: disconnected by user
       216.197.76.234 : 1 Time(s)

 ---------------------- SSHD End -------------------------


 --------------------- Sudo (secure-log) Begin ------------------------


 cd3ef => root
 -------------
 /bin/sh                        -  36 Time(s).

 cd3ef => ubuntu
 ---------------
 /bin/sh                        -   6 Time(s).

 ---------------------- Sudo (secure-log) End -------------------------


 --------------------- Disk Space Begin ------------------------

 Filesystem                                 Size  Used Avail Use% Mounted on
 udev                                       2.0G   12K  2.0G   1% /dev
 /dev/xvda1                                  32G  4.0G   27G  14% /
 fs-97ff5bde.efs.us-east-1.amazonaws.com:/  8.0E  3.7G  8.0E   1% /mnt/efs/apt


 ---------------------- Disk Space End -------------------------


 ###################### Logwatch End #########################
AWS Cloudwatch

By default all instances in Amazon Web Services have basic Cloudwatch monitoring enabled which provides seven pre-selected metrics at five-minute frequency and three status check metrics at one-minute frequency[2]. All production instances (EC2 and RDS) have detailed Cloudwatch enabled which include additional metrics.

Currently there are no Cloudwatch alarms enabled on these metrics but rather used for a post-mortem analysis or a second layer of metrics. APTrust uses a combination of Grafana and Icinga2 metrics for troubleshooting and alarms.

Business Continuity

Interopability and Portability

Backup

Disaster Recovery

Risk Management, Threats, and Mitigations

Risk Management describes casting and evaluation of risks together with the identification of procedures to avoid or minimize their impact. Due to the repositories' architecture and usage of Amazon Web Services (AWS) some of the threats and mitigations are covered by a Shared Security Responsibilty Model[1] and hence shared between APTrust and Amazon Web Services (AWS) as infrastructure provider.

The system architecture and operations policies of APTrust are based on the threat model which was formalized in a 2005 paper published by the LOCKSS team, Requirements for Digital Preservation Systems: A Bottom-Up Approach, and periodic reviews of code, configuration and policies. The identified threats are not unique to digital preservation repositories but sometimes require different mitigation strategies due to the long time horizon preservation systems are build and maintained for.

The paper identified the following threats:

Media Failure

All storage media must be expected to degrade with time, causing irrecoverable bit errors, and to be subject to sudden catastrophic irrecoverable loss of bulk data such as disk crashes[2] or loss of off-line media.
Mitigation

- APTrust stores multiple copies of data objects in multiple locations (Virginia and Oregon) to mitigate localized failures.

- Regular fixity checks ensure the accuracy and authenticity of the data object.

- The storage backend used, AWS Simple Storage Service (S3), are designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities[3]. A second tier storage Amazon Glacier also redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing[4].

- Storage media is managed by AWS and underlying "disk crashes" mitigated by the vendor as part of their shared security responsibilities[1].

Hardware Failure

All hardware components must be expected to suffer transient recoverable failures, such as power loss, and catastrophic irrecoverable failures, such as burnt-out power supplies.
Mitigation

- AWS manages all of our infrastructure as Infrastructure as a Service (IaaS) which covers monitoring, management and replacement of faulty hardware.

- If hardware outages occur APTrust is able to recover quickly due to it's automated provisioning and configuration management. New server instances can be provisioned quickly.

- Due to continuous monitoring and management of hardware by AWS, failures are rare.

Software Failure

 All software components must be expected to suffer from bugs that pose a risk to the stored data.
Mitigation

- APTrust adheres to software development guidelines that include regular unit and integration tests with written software. We employ a continuous integration service (Travis CI) that runs unit tests after every git commit. Failing unit tests are fixed immediately after they become apparent. Travis CI actively notifies the APTrust team via Slack about success of failure.

- In addition to unit tests APTrust applies an integration tests to make sure services are running and interacting as a system as expected. Integration tests are run locally before every git commit.

- The release process consists of a deploy on a demo/staging environment first. After successful deployment there, software is deployed into production.

- For destructive actions like deletion of ingested intellectual objects APTrust implemented a two factor deletion process that requires an institutional administrator to approve a deletion request before the object gets deleted.

Communication Errors

Systems cannot assume that the network transfers they use to ingest or disseminate content will either succeed or fail within a specified time period, or will actually deliver the content unaltered. A recent study "suggests that between one (data) packet in every 16 million packets and one packet in 10 billion packets will have an undetected checksum error."
- All communication channels the APTrust repository uses are configured to use TLS/SSL[5] to communicate which ensures secure communication as well as data integrity in transit[6].

- Ingest/Submission: Before data is transmitted for ingest to APTrust, a tag manifest file that includes checksums for each individual file in the bag is included and used for verification of data integrity during the ingest process. Bags without manifest files are invalid and not being ingested.

- Restore/Dissemination: When content restoration is requested, a distribution package will be created consisting of the original files and metadata written to a BagIt bag conforming to the current APTrust BagIt profile.  The files in the data directory will be the exact same name and bits that were sent to APTrust in the submission bag and the metadata written to tag files in the bag adhering to the current APTrust BagIt format.  

Failure of Network Services

Systems must anticipate that the external network services they use, including resolvers such as those for domain names [39] and persistent URLs [44], will suffer both transient and irrecoverable failures both of the network services and of individual entries in them. As examples, domain names will vanish or be reassigned if the registrant fails to pay the registrar, and a persistent URL will fail to resolve if the resolver service fails to preserve its data with as much care as the digital preservation service.
APTrust uses currently only one DNS provider (Network Solutions), it is possible that the DNS may fail and no secondary DNS will be failed over to. This may affect the repository services.

Ingest: The ingest services are running on a single node that does not rely on DNS as all services are accessing itself locally. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.

Restore: The restore process is triggered by the repositories web frontend PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. which in case of resolver issues wouldn't be reachable and prohibit new restores. Failure of either is presumed to be transient, and thus to delay but not prevent restore.

Risk: Since all services and infrastructure resides in a cloud environment that has redundancy and adheres to SLAs long outages are unlikely. In case of a DNS outage system administrators could define local resolvers to keep internal services running. The external repository frontend may be compromised.

Media & Hardware Obsolescence

All media and hardware components will eventually fail. Before that, they may become obsolete in the sense of no longer being capable of communicating with other system components or being replaced when they do fail. This problem is particularly acute for removable media, which have a long history of remaining theoretically readable if only a suitable reader could be found.
APTrust is solely running on AWS infrastructure which is subject to the shared responsibility model[7]. AWS assumes responsibility of underlying hardware replacement and obsolescence. APTrust is informed of degraded hardware and if migration is necessary.
AWS Shared Responsibility Matrix

Risk: The assumed risk is low as APTrust is using Infrastructure as a service (IaaS) which makes management of physical hardware obsolete and hardware failure a non-issue.

Software Obsolescence

Similarly, software components will become obsolete. This will often be manifested as format obsolescence when, although the bits in which some data was encoded remain accessible, the information can no longer be decoded from the storage format into a legible form.
All software used in the APTrust repository is open source and most heavily community supported. We employ regular update and upgrade of software libraries to avoid security issues or out of date software. Wherever possible APTrust tries to keep a copy of software dependencies with its software.

Data format obsolescence is a risk which mostly lies with the depositors. The repository is kept up-to-date and functional as far as preservation activities go.

Since all the software is free and open-source, no financial provision need be made for its replacement or upgrade.

Risk: The risk of software obsolescence is assessed low. All software components are managed and updated on a regular basis.

Operator Error

Operator actions must be expected to include both recoverable and irrecoverable errors. This applies not merely to the digital preservation application itself, but also to the operating system on which it is running, the other applications sharing the same environment, the hardware underlying them, and the network through which they communicate.
Software:

Each software component of the repository that is developed by APTrust staff is unit tested with every commit. Before ingest, restore or fixity services are being deployed integration tests are executed to ensure predictable functionality and avoid bugs.

Infrastructure & Configuration:

APTrust strives to use automation wherever possible. With the advent as infrastructure-as-code and application of software engineering principles to infrastructure management many risks are mitigated or lowered by carefully crafted and tested automations.

Security:

APTrust employs several security principles throughout the development, deployment, and administrative processes. As part of the defense-in-depth principle each environment (production, demo, and test) are physically (deployed on different servers) and virtually (AWS security groups prohibit network access) separated. Each environment has its own set of credentials, even if an operator were to deploy a demo environment on a production instance, none of the access credentials would work hence avoid possible destructive actions

Risk: By applying software engineering principles and a suite of tests before anything gets put into production the risk of error is assessed as low.

Natural Disaster

APTrusts content is stored in geographically diverse locations. The primary S3 storage is part of the US-Standard region and routes copies across the Northern Virginia and Pacific Northwest[8].

A disaster affecting Northern Virginia may interrupt ingest, restore and web frontend services but due to automated provisioning of most of the infrastructure can be completely re-provisioned within an hour in a different geographic location.

Risk: Given that the risk of a natural disaster in Northern Virginia is relatively low,[9] the risk of affecting ingest/restore services is low.

External Attack

As documented in APTrust Security each instance is configured with minimal outside access (if at all) to prevent malicious actions. All communications among each node and service are encrypted using SSH and TLS. Database instances are not publicly accessible from the outside and are firewall controlled only allowing certain instances access. The surface available to an external attacker is thus minimized.

Each server instance is patched on a regular schedule (every 24 hours), security updates are applied immediately. Non-security updates may require administrator intervention but are usually applied bi-weekly.

Firewall (AWS security groups) are reviewed monthly.

The following precautions are taken to prevent unauthorized access to APTrusts web interface:

  • - Access per SSL/TLS only
  • - Firewall rules deny all except web.
  • - Administrative shell access is limited to designated APTrust operators.
  • - Access only with an account and password credential
  • - Separation of administrative and user accounts per institution

Risk: A security update could be released that is faulty. If an attacker would gain access to the web interface using someone elses access credentials he would be able to issue deletion or restore requests. Deletion requests however require a second factor (administrators permission). The risk is assessed moderate.

Internal Attack

Members of the APTrust staff have delineated roles, responsibilities and authorizations regarding changes and access.

  • Administrative access to nodes is limited to 3 employees. Only two employees have superuser (sudo) access.
  • All access is logged.
  • Non-technical staff has administrative access to the web interface. Staff that has administrative access has been vetted.

Risk: There is a risk that staff could act maliciously. However the risk is limited to only two employees with sudo, system level access, and three administrative access. All employees are vetted by background checks before being hired and malicious intend is unlikely. The risk is moderate.

Economic Failure

APTrust is funded by it's sustaining members by a yearly fee as well as the University of Virginia, which funds above the nominal fee per year. APTrust keeps to a budget and maintains enough reserves to sustain beyond a 1-year period without any incoming funds. Loss of APTrust Repository sustaining members would reduce funding but not affect its operations.

APTrust has been economically sustainable for the past 4 years.

Risk: The risk of economic failure is low as the organization is healthy and operates economically. However loss of multiple sustaining members would increase the risk of failure. The current risk is assessed as low.

Organizational Failure

APTrust has not formalized a succession plan yet but is well aware of the risk. In case of an organizational failure deposits could reside in AWS and custody and costs could be transferred to it's owner/sustainable member.

APTrust has drafted a succession policy that, once approved can mitigate this risk.

Risk: The risk is assessed as moderate.

Relevant Documents

  1. Requirements for Digital Preservation Systems: A Bottom-Up Approach
  1. 1.01.1 https://aws.amazon.com/compliance/shared-responsibility-model/
  2. TALAGALA, N. Characterizing Large Storage Systems: Error Behavior and Performance Benchmarks. PhD thesis, CS Div., Univ. of California at Berkeley, Berkeley, CA, USA, Oct. 1999. <http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1066.pdf>
  3. https://aws.amazon.com/s3/faqs/
  4. https://aws.amazon.com/glacier/faqs/
  5. https://en.wikipedia.org/wiki/Transport_Layer_Security
  6. https://en.wikipedia.org/wiki/Message_authentication_code#Message_integrity_codes
  7. This shared model can help relieve customer’s operational burden as AWS operates, manages and controls the components from the host operating system and virtualization layer down to the physical security of the facilities in which the service operates. The customer assumes responsibility and management of the guest operating system (including updates and security patches), other associated application software as well as the configuration of the AWS provided security group firewall. Customers should carefully consider the services they choose as their responsibilities vary depending on the services used, the integration of those services into their IT environment, and applicable laws and regulations. The nature of this shared responsibility also provides the flexibility and customer control that permits the deployment. As shown in the chart below, this differentiation of responsibility is commonly referred to as Security “of” the Cloud versus Security “in” the Cloud. https://aws.amazon.com/compliance/shared-responsibility-model/
  8. Preservation and Storage#Geographic Diversity
  9. https://www.washingtonpost.com/news/capital-weather-gang/wp/2013/08/29/d-c-s-maryland-suburbs-have-among-nations-lowest-disaster-risk/?utm_term=.151d09279e15

Monitoring

Member API

The member-api allows depositors to programmatically query for status of events, files, intellectual objects or items in a work queue.


Preservation and Storage

APTrust breaks bags into individual files upon ingest and saves each file to Amazon's S3 storage in Northern Virginia and to Glacier storage in Oregon. We maintain a central registry describing all intellectual objects, which files belong to each object, and where those files are stored. The registry is a SQL database that is replicated across multiple AWS availability zones in Virginia, and to Oregon. We also store daily and weekly snapshots of the database on an EBS (disk) drive in Amazon's Virginia data center. (This is the SQL database underlying our PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. application, which provides access to data through both a Web UI and a REST APIApplication Programming Interface.)

In addition to the SQL database, we tag each file stored in S3 and Glacier with the following metadata:

  • Content Type - This is stored in as a mime type. E.g. image/jpeg, text/plain, etc.
  • Bag - This is the name of the intellectual object to which the file belongs.
  • Bag Path - This is the original path of the file within the bag (SIP) when we received the bag for ingest.
  • Institution - The domain name of the institution that owns the file. For example, virginia.edu, ncsu.edu, etc.
  • MD5 - The md5 checksum of the file.
  • SHA256 - The sha256 checksum of the file.

If APTrust were to lose its copies of the SQL database in both availability zones in Virginia, and the copy in Oregon, and the backups, we would still know where all of our preservation files are. We can locate and retrieve all AIPs and reconstruct our registry (except for PREMIS events) based on the metadata attached to those files.

This is an example of the metadata attached to each of the files stored in our Virginia/S3 and Oregon/Glacier preservation storage areas.

The image to the right shows an example of the metadata attached to each file.

DPN

Software Development Styleguide

The Software Development Styleguide documents practices and policies that APTrust engineering staff adheres to.

Support and Maintenance

Operations procedures and detailed technical documentation.