Definition of AIP

From aptrust
Jump to: navigation, search

OAIS Archival Information Package (AIP)

The OAIS definition of AIP is:
Archival Information Package (AIP): An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.
The OAIS discussion of AIP is:
Within the OAIS one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated Content Information. The Packaging Information of the AIP will conform to OAIS internal standards, and it may vary as it is managed by the OAIS.
The OAIS definition of Information Property is:
Information Property: That part of the Content Information as described by the Information Property Description. The detailed expression, or value, of that part of the information content is conveyed by the appropriate parts of the Content Data Object and its Representation Information.
The OAIS discussion of Information Properties is:
The Producer may provide, or the Archive may itself define, as part of the Provenance Information, Information Property Descriptions of Information Properties which should be maintained over time, and indeed may provide Information Property Descriptions of Information Properties which do not need to be maintained over time. An Information Property is that part of the Content Information as described by the Information Property Description. An Information Property Description is a description of a part of the information content of a Content Information object that is highlighted for a particular purpose. The detailed expression, or value, of that part of the information content is conveyed by the appropriate parts of the Content Data Object and its Representation Information. For example, consider a simple digital book which when rendered appears as pages with margins, title, chapter headings, paragraphs, and text lines composed of words and punctuation. Information Property Descriptions for Information Properties that must be preserved could be expressed as ‘paragraph identification’ and ‘characters expressing words and punctuation’. The Information Properties would consist of all the book’s paragraph identifications, words, and punctuation as expressed by the Content Data Object and its Representation Information. This means that all formatting other than the recognition of paragraphs and readable text could be altered while still maintaining required preservation. The Archive may express an evaluation of the Authenticity of its holdings, based on community practice and recommendations (including best practices, guidelines, standards, and legal requirements). For example scientific Archives may have less stringent evaluation criteria than State Archives; however, the Consumer may make his/her own judgment of the Authenticity starting with the evidence obtained from PDI.
The OAIS definition of Representation Information is:
Representation Information: The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard. Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.
The OAIS discussion of Representation Information is:
In general, it can be said that ‘Data interpreted using its Representation Information yields Information’, ... In order for this Information Object to be successfully preserved, it is critical for an OAIS to identify clearly and to understand clearly the Data Object and its associated Representation Information. For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits. ... As a further complication, the recursive nature of Representation Information, which typically is composed of its own data and its own Representation Information, typically leads to a network of Representation Information objects. Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base, which would be the equivalent of extending the definition of the Designated Community. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding. The choice, for an OAIS, to collect all the relevant Representation Information or to reference its existence in another trusted or partner OAIS Archive, is an implementation and organization decision.
Other relevant OAIS definitions:
Content Data Object: The Data Object, that together with associated Representation Information, comprises the Content Information.
Content Information: A set of information that is the original target of preservation or that includes part or all of that information. It is an Information Object composed of its Content Data Object and its Representation Information.
Data Object: Either a Physical Object or a Digital Object.
Information Object: A Data Object together with its Representation Information.
Package Description: A structured form of Descriptive Information which enables the Consumer to locate information of potential interest, analyze that information, and order desired information.
Packaging Information: Information that which, either actually or logically, binds or related the components of the package into an identifiable entity on specific media.
Preservation Description Information (PDI): The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.

APTrust Archival Information Package (AIP)

Definition

An APTrust AIP consists of a unique identifier for an intellectual object (content information), generic files (data objects), characterization data (representation information), and PREMIS events (preservation description information). These AIP components are not stored as discrete packages but are broken apart for management and storage as described below in Transforming SIPs into AIPs.

Content Information and Information Properties

Content information includes one or more data objects and representation information that describes the meaning of the data object. APTrust collectively refers to the content information as intellectual objects. APTrust refers to data objects as generic files. Each data object receives file characterization using FITS (File Information Tool Set from Harvard Library), the resulting data from FITS forms the representation information.

APTrust defines content information as whatever the member institution includes in the data directory of the SIP. A member institution may include an entire collection of related data objects or a single data object at their discretion, either with relevant representation information. Packaging information is considered to be part of the content information. APTrust SIPs are submitted as uncompressed tar archives. When AIPs are transformed into DIPs for restoration, APTrust commits to returning an exact copy of the SIP, with the exception of using the latest APTrust bagging specification for transporting packages.

APTrust preserves the entire set of information properties associated with the content information as submitted in the SIP. In its core service, APTrust does NOT perform specific file-changing "preservation actions" on the files during or after ingest. (This includes actions such as file format migration, file normalization, and the creation of descriptive metadata and rights management.) In its core service, the APTrust does not provide access to deposited data files to any party other than the depositor and APTrust staff.

Preservation Description Information (PDI)

APTrust combines member-submitted data (part of the SIP), FITS generated representation information, and system-generated events to create PDI.

  • Provenance: All interactions with content as PREMIS events for a complete audit trail. At the time of SIP upload, a member's unique identifier is logged, along with the date and time. Additional events are created for each data object (generic file) as they are assigned access rights, captured, deleted, checksum calculation, identifier assignment, ingested, and restored.
  • Reference: As each piece of content information (intellectual object) is deposited into APTrust, it is given a unique identifier. This is composed of the member institution’s domain name and the name of the content information (bag name). The identifier is used to locate specific content with APTrust.
  • Fixity: All SIPs are uploaded with a content manifest that includes checksums for each data object (generic file). This is used to verify fixity upon ingest. An additional checksum using the SHA-256 algorithm is generated when each data object receives file characterization. Both of the checksums are stored in the representation information for each data object. Data objects undergo a fixity audit every 90 days to ensure that content has not been altered in an undocumented manner.
  • Context: The context for content information (intellectual object) is supplied by the member as part of the SIP. It consists of associated metadata, including title and description. Members may include additional descriptive information as it relates to the content information. APTrust stores the related intellectual object (content information) in the metadata for each generic file (data object).
  • Access Rights: APTrust is a dark archive, access controls define who can contribute content based on the member institution’s unique identifier. A member may have designated users and administrators. Members may designate one of three levels of access for content information (intellectual object) upon submission: consortia, institution, and restricted. These access levels refer only to the content information as a whole, within APTrust. It is up to member institution to manage any access rights of the general public as they see fit.
    • Consortia: All APTrust members can view the intellectual object and its associated generic files. Only the owning institution's users and administrators can delete or restore intellectual objects.
    • Institution: Only users and administrators from the intellectual object's owning institution can view the intellectual object and it's associated generic files and delete or restore intellectual objects.
    • Restricted: Users from the intellectual object's owning institution can view the intellectual object and it's associated generic files.Only administrators from the intellectual object's owning institution can delete or restore intellectual objects.

Transforming SIPs into AIPs

1. Depositor uploads object to S3 receiving bucket. Object is in APTrust Baglt format. 2. System retrieves the bag from the receiving bucket and validates it. 3. System copies bag files individually to S3Nirginia. Each file is given a UUID and is tagged with metadata, including original bag name and file name. 4. System copies bag files individually to Glacier/Oregon with same UUID and metadata as S3. 5. System copies all bag and file metadata to a SQL database. APTrust assumes responsibility for the uploaded contents after step 5.

An APTrust member institution will prepare a SIP according to the APTrust Bag-It Specification.

In APTrust, AIPs are not stored as discrete packages. Rather, the Content Information is stored separately from the PDI, which is stored in an SQL database (Postgres).

  1. APTrust member institution uploads SIP (tarred bag) to an Amazon Web Services bucket assigned to the member institution (receiving bucket).
  2. On an hourly basis, APTrust checks the bucket for new content. Upon receipt of a SIP, APTrust will replicate the bag to a working environment where the bag is validated.
  3. APTrust will create an intellectual object for the SIP and assign a unique identifier. Tag files submitted with the bag will be parsed to index PDI and APTrust then extracts the contents of the tarred bag. Generic files are created for each data object, assigning a unique identifier for each file. Each generic file will undergo characterization via FITS to generate representation information that is indexed. Then, generic files are replicated thrice to Amazon S3 (Virginia, United States region). Each of the three instances in Amazon S3 will share the same unique identifier and metadata.
  4. From the three replicated copies in Amazon S3, APTrust will replicate three additional copies to Amazon Glacier (Oregon, United States region), again sharing a unique identifier and metadata.
  5. After all six copies have been replicated, APTrust will add Amazon AWS generated metadata to the SQL database. Metadata includes one record for the Intellectual Object (the bag itself) and one Generic File record for each file copied from the bag to long-term storage. In addition, APTrust creates PREMIS events describing the ingestion of the bag, and the ingestion, identifier assignment, initial MD5 and SHA256 fixities, and storage location of each file in the bag.
  6. At this point, the bag is fully ingested, and APTrust assumes responsibility for its contents. APTrust does not assume responsibility for the bag before it is fully ingested.
  7. The original SIP (tarred bag) is then deleted from the S3 receiving bucket and temporary processing area. The SIPs are not stored in an APTrust repository but can be recreated from the AIP.

The metadata stored in the SQL database forms the Package Description that contains the Associated Description that supplies data to the Retrieval Aid (APTrust repository) that allows authorized users to retrieve the Content Information and PDI.

Every 90 days, the Amazon S3 instance under goes a fixity audit. All three copies are retrieved from Amazon S3 into a working environment where both MD5 and SHA-256 checksums are calculated and compared to the checksums stored in the SQL database. When checksums match, APTrust will then compare the checksums to the stored SHA-256 checksum generated by Amazon Glacier for the instances there. If corruption is found in at least one instance while at least one other instance is valid, APTrust will repair the corrupted copy with a new copy from the valid instance. If all three copies in Amazon S3 are found to be corrupt, the Amazon Glacier instances will be replicated for additional comparison. These fixity audits are logged as PREMIS events and attached to each generic file.

Relevant Documents