Risk Management, Threats, and Mitigations

From aptrust
Jump to: navigation, search

Risk Management describes casting and evaluation of risks together with the identification of procedures to avoid or minimize their impact. Due to the repositories' architecture and usage of Amazon Web Services (AWS) some of the threats and mitigations are covered by a Shared Security Responsibilty Model[1] and hence shared between APTrust and Amazon Web Services (AWS) as infrastructure provider.

The system architecture and operations policies of APTrust are based on the threat model which was formalized in a 2005 paper published by the LOCKSS team, Requirements for Digital Preservation Systems: A Bottom-Up Approach, and periodic reviews of code, configuration and policies. The identified threats are not unique to digital preservation repositories but sometimes require different mitigation strategies due to the long time horizon preservation systems are build and maintained for.

The paper identified the following threats:

Media Failure

All storage media must be expected to degrade with time, causing irrecoverable bit errors, and to be subject to sudden catastrophic irrecoverable loss of bulk data such as disk crashes[2] or loss of off-line media.
Mitigation

- APTrust stores multiple copies of data objects in multiple locations (Virginia and Oregon) to mitigate localized failures.

- Regular fixity checks ensure the accuracy and authenticity of the data object.

- The storage backend used, AWS Simple Storage Service (S3), are designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities[3]. A second tier storage Amazon Glacier also redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing[4].

- Storage media is managed by AWS and underlying "disk crashes" mitigated by the vendor as part of their shared security responsibilities[1].

Hardware Failure

All hardware components must be expected to suffer transient recoverable failures, such as power loss, and catastrophic irrecoverable failures, such as burnt-out power supplies.
Mitigation

- AWS manages all of our infrastructure as Infrastructure as a Service (IaaS) which covers monitoring, management and replacement of faulty hardware.

- If hardware outages occur APTrust is able to recover quickly due to it's automated provisioning and configuration management. New server instances can be provisioned quickly.

- Due to continuous monitoring and management of hardware by AWS, failures are rare.

Software Failure

 All software components must be expected to suffer from bugs that pose a risk to the stored data.
Mitigation

- APTrust adheres to software development guidelines that include regular unit and integration tests with written software. We employ a continuous integration service (Travis CI) that runs unit tests after every git commit. Failing unit tests are fixed immediately after they become apparent. Travis CI actively notifies the APTrust team via Slack about success of failure.

- In addition to unit tests APTrust applies an integration tests to make sure services are running and interacting as a system as expected. Integration tests are run locally before every git commit.

- The release process consists of a deploy on a demo/staging environment first. After successful deployment there, software is deployed into production.

- For destructive actions like deletion of ingested intellectual objects APTrust implemented a two factor deletion process that requires an institutional administrator to approve a deletion request before the object gets deleted.

Communication Errors

Systems cannot assume that the network transfers they use to ingest or disseminate content will either succeed or fail within a specified time period, or will actually deliver the content unaltered. A recent study "suggests that between one (data) packet in every 16 million packets and one packet in 10 billion packets will have an undetected checksum error."
- All communication channels the APTrust repository uses are configured to use TLS/SSL[5] to communicate which ensures secure communication as well as data integrity in transit[6].

- Ingest/Submission: Before data is transmitted for ingest to APTrust, a tag manifest file that includes checksums for each individual file in the bag is included and used for verification of data integrity during the ingest process. Bags without manifest files are invalid and not being ingested.

- Restore/Dissemination: When content restoration is requested, a distribution package will be created consisting of the original files and metadata written to a BagIt bag conforming to the current APTrust BagIt profile.  The files in the data directory will be the exact same name and bits that were sent to APTrust in the submission bag and the metadata written to tag files in the bag adhering to the current APTrust BagIt format.  

Failure of Network Services

Systems must anticipate that the external network services they use, including resolvers such as those for domain names [39] and persistent URLs [44], will suffer both transient and irrecoverable failures both of the network services and of individual entries in them. As examples, domain names will vanish or be reassigned if the registrant fails to pay the registrar, and a persistent URL will fail to resolve if the resolver service fails to preserve its data with as much care as the digital preservation service.
APTrust uses currently only one DNS provider (Network Solutions), it is possible that the DNS may fail and no secondary DNS will be failed over to. This may affect the repository services.

Ingest: The ingest services are running on a single node that does not rely on DNS as all services are accessing itself locally. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.

Restore: The restore process is triggered by the repositories web frontend PharosPharos is APTrusts web interface to manage deposits and inspect deposit outcomesETag The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. which in case of resolver issues wouldn't be reachable and prohibit new restores. Failure of either is presumed to be transient, and thus to delay but not prevent restore.

Risk: Since all services and infrastructure resides in a cloud environment that has redundancy and adheres to SLAs long outages are unlikely. In case of a DNS outage system administrators could define local resolvers to keep internal services running. The external repository frontend may be compromised.

Media & Hardware Obsolescence

All media and hardware components will eventually fail. Before that, they may become obsolete in the sense of no longer being capable of communicating with other system components or being replaced when they do fail. This problem is particularly acute for removable media, which have a long history of remaining theoretically readable if only a suitable reader could be found.
APTrust is solely running on AWS infrastructure which is subject to the shared responsibility model[7]. AWS assumes responsibility of underlying hardware replacement and obsolescence. APTrust is informed of degraded hardware and if migration is necessary.
AWS Shared Responsibility Matrix

Risk: The assumed risk is low as APTrust is using Infrastructure as a service (IaaS) which makes management of physical hardware obsolete and hardware failure a non-issue.

Software Obsolescence

Similarly, software components will become obsolete. This will often be manifested as format obsolescence when, although the bits in which some data was encoded remain accessible, the information can no longer be decoded from the storage format into a legible form.
All software used in the APTrust repository is open source and most heavily community supported. We employ regular update and upgrade of software libraries to avoid security issues or out of date software. Wherever possible APTrust tries to keep a copy of software dependencies with its software.

Data format obsolescence is a risk which mostly lies with the depositors. The repository is kept up-to-date and functional as far as preservation activities go.

Since all the software is free and open-source, no financial provision need be made for its replacement or upgrade.

Risk: The risk of software obsolescence is assessed low. All software components are managed and updated on a regular basis.

Operator Error

Operator actions must be expected to include both recoverable and irrecoverable errors. This applies not merely to the digital preservation application itself, but also to the operating system on which it is running, the other applications sharing the same environment, the hardware underlying them, and the network through which they communicate.
Software:

Each software component of the repository that is developed by APTrust staff is unit tested with every commit. Before ingest, restore or fixity services are being deployed integration tests are executed to ensure predictable functionality and avoid bugs.

Infrastructure & Configuration:

APTrust strives to use automation wherever possible. With the advent as infrastructure-as-code and application of software engineering principles to infrastructure management many risks are mitigated or lowered by carefully crafted and tested automations.

Security:

APTrust employs several security principles throughout the development, deployment, and administrative processes. As part of the defense-in-depth principle each environment (production, demo, and test) are physically (deployed on different servers) and virtually (AWS security groups prohibit network access) separated. Each environment has its own set of credentials, even if an operator were to deploy a demo environment on a production instance, none of the access credentials would work hence avoid possible destructive actions

Risk: By applying software engineering principles and a suite of tests before anything gets put into production the risk of error is assessed as low.

Natural Disaster

APTrusts content is stored in geographically diverse locations. The primary S3 storage is part of the US-Standard region and routes copies across the Northern Virginia and Pacific Northwest[8].

A disaster affecting Northern Virginia may interrupt ingest, restore and web frontend services but due to automated provisioning of most of the infrastructure can be completely re-provisioned within an hour in a different geographic location.

Risk: Given that the risk of a natural disaster in Northern Virginia is relatively low,[9] the risk of affecting ingest/restore services is low.

External Attack

As documented in APTrust Security each instance is configured with minimal outside access (if at all) to prevent malicious actions. All communications among each node and service are encrypted using SSH and TLS. Database instances are not publicly accessible from the outside and are firewall controlled only allowing certain instances access. The surface available to an external attacker is thus minimized.

Each server instance is patched on a regular schedule (every 24 hours), security updates are applied immediately. Non-security updates may require administrator intervention but are usually applied bi-weekly.

Firewall (AWS security groups) are reviewed monthly.

The following precautions are taken to prevent unauthorized access to APTrusts web interface:

  • - Access per SSL/TLS only
  • - Firewall rules deny all except web.
  • - Administrative shell access is limited to designated APTrust operators.
  • - Access only with an account and password credential
  • - Separation of administrative and user accounts per institution

Risk: A security update could be released that is faulty. If an attacker would gain access to the web interface using someone elses access credentials he would be able to issue deletion or restore requests. Deletion requests however require a second factor (administrators permission). The risk is assessed moderate.

Internal Attack

Members of the APTrust staff have delineated roles, responsibilities and authorizations regarding changes and access.

  • Administrative access to nodes is limited to 3 employees. Only two employees have superuser (sudo) access.
  • All access is logged.
  • Non-technical staff has administrative access to the web interface. Staff that has administrative access has been vetted.

Risk: There is a risk that staff could act maliciously. However the risk is limited to only two employees with sudo, system level access, and three administrative access. All employees are vetted by background checks before being hired and malicious intend is unlikely. The risk is moderate.

Economic Failure

APTrust is funded by it's sustaining members by a yearly fee as well as the University of Virginia, which funds above the nominal fee per year. APTrust keeps to a budget and maintains enough reserves to sustain beyond a 1-year period without any incoming funds. Loss of APTrust Repository sustaining members would reduce funding but not affect its operations.

APTrust has been economically sustainable for the past 4 years.

Risk: The risk of economic failure is low as the organization is healthy and operates economically. However loss of multiple sustaining members would increase the risk of failure. The current risk is assessed as low.

Organizational Failure

In case of an organizational failure deposits could reside in AWS and custody and costs could be transferred to it's owner/sustainable member.

The APTrust Succession Plan, approved in 2018, mitigates this risk.

Risk: The risk is assessed as moderate.

Relevant Documents

  1. Requirements for Digital Preservation Systems: A Bottom-Up Approach
  1. 1.01.1 https://aws.amazon.com/compliance/shared-responsibility-model/
  2. TALAGALA, N. Characterizing Large Storage Systems: Error Behavior and Performance Benchmarks. PhD thesis, CS Div., Univ. of California at Berkeley, Berkeley, CA, USA, Oct. 1999. <http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1066.pdf>
  3. https://aws.amazon.com/s3/faqs/
  4. https://aws.amazon.com/glacier/faqs/
  5. https://en.wikipedia.org/wiki/Transport_Layer_Security
  6. https://en.wikipedia.org/wiki/Message_authentication_code#Message_integrity_codes
  7. This shared model can help relieve customer’s operational burden as AWS operates, manages and controls the components from the host operating system and virtualization layer down to the physical security of the facilities in which the service operates. The customer assumes responsibility and management of the guest operating system (including updates and security patches), other associated application software as well as the configuration of the AWS provided security group firewall. Customers should carefully consider the services they choose as their responsibilities vary depending on the services used, the integration of those services into their IT environment, and applicable laws and regulations. The nature of this shared responsibility also provides the flexibility and customer control that permits the deployment. As shown in the chart below, this differentiation of responsibility is commonly referred to as Security “of” the Cloud versus Security “in” the Cloud. https://aws.amazon.com/compliance/shared-responsibility-model/
  8. Preservation and Storage#Geographic Diversity
  9. https://www.washingtonpost.com/news/capital-weather-gang/wp/2013/08/29/d-c-s-maryland-suburbs-have-among-nations-lowest-disaster-risk/?utm_term=.151d09279e15