Skip to content
This repository has been archived by the owner on Jun 27, 2020. It is now read-only.

Batch Ingest (2013 Redesign)

Jim Coble edited this page Apr 14, 2015 · 1 revision

Duke University Libraries has a number of digitized collections (primarily, still images) whose digital master files are stored on a file system and for which metadata (descriptive and technical) exists in various formats and locations. The initial goal of our batch ingest process is to facilitate the ingest of these master files and associated metadata into a preservation repository. Public access to these collections is handled through a separate system so we are not currently concerned either with public access or with ingest of access derivatives.

History (v1.0)

Version 1.0 of the batch ingest process accomplished the intended goal but with code that was not well-designed.

Re-Design (v1.1)

A major re-factoring of the process is underway for v1.1. At the time of this writing, the refactoring work in progress can be found in the batch-ingest branch.

Conceptually, it may be easier to start at the tail end of the re-designed process, with a set of normalized data ("ingest batch objects") that contain all the information needed to create corresponding ActiveFedora objects -- see section "Ingest Batch Objects" below. After that, we'll consider one way in which we plan to get these "ingest batch objects" to start with -- see "Ingest Manifests" further below.

Ingest Batch Objects

A key aspect of the new design is the introduction of an IngestBatchObject ActiveRecord model just prior to the actual ingest action. (IngestBatchObject extends a BatchObject model, with an eye toward possible later development of an UpdateBatchObject model.) Each IngestBatchObject object represents a single object that is to be ingested into the repository. It is intended to contain all the information needed to properly create the corresponding ActiveFedora object, including attributes for (ActiveFedora) model and object label, as well as related BatchObjectDatastream and BatchObjectRelationship objects containing information about the datastreams and relationships to be created during the ingest. The IngestBatchObject model also includes an attribute for storing the Fedora PID assigned to the object at ingest.

BatchObjectDatastream objects contain information about each datastream (other than DC and RELS-EXT) that should be created during the ingest. Attributes include the name of the datastream, the "payload" to be put into the datastream, either as a path to a file or as bytes (though we are implementing only the former at the moment), and, optionally, an externally calculated checksum that will be used for verifying the ingest.

BatchObjectRelationship objects contain information about relationships (other than hasModel) that should be created during the ingest. Attributes include the name of the relationship (e.g., "parent", "child", "admin_policy") and a designation of the object of the relationship (i.e., its PID, in our current implementation).

IngestBatchObject objects are grouped into a Batch, which is processed by running BatchProcessor. BatchProcessor validates the IngestBatchObject's in the Batch and then ingests each one in turn by creating the appropriate ActiveFedora model object, setting its properties, and adding the appropriate datastreams and relationships. The BatchProcessor also performs a number of verification steps on the ingested object, including comparing the Fedora-generated checksum with the externally calculated checksum if one is provided. Running the BatchProcessor produces a BatchRun object, which summarizes statistics related to the ingest and its outcome (success / failure).

Ingest Manifests

The normalized "ingest batch objects" described in the section above can be created in any number of ways. This section describes the approach we are taking with our legacy digitized collections. There is a certain degree of commonality among these collections in terms of where the digital content files are stored, how they are named, where the descriptive and technical metadata resides, and what format the metadata uses ... but there are certain differences between various collections and even within any given collection, due to evolving practices over a number of years, outsourcing, etc.

We decided to use the concept of a "manifest" to describe a group of related legacy content and its metadata. A manifest is a YAML file. It has two primary sections ... a "manifest-level" section that applies to all the objects covered by the manifest and an "objects" section that enumerates the objects covered by the manifest and provides the ability to override any or all of the manifest-level settings for that particular object. The keys available for use in a manifest YAML file are documented on the Batch Ingest Manifest File (Revised) page.

Manifest and ManifestObject objects provide an object-oriented representation of the information in a manifest YAML file. The Manifest is processed by a ManifestProcessor to produce Ingest Batch Objects and add them to a new or existing Batch.

Our Ingest Manifest implementation contains much that is specific to our particular context, including certain assumptions that likely will not hold true in all situations. We would be glad to work with others in making this part of the batch ingest process more generalized ... but recognize there will likely always be certain aspects of operations involving pre-existing content that is institution-specific.

Also note that nothing in the batch ingest processing described above in the "Ingest Batch Objects" section depends on the ingest batch objects having been created from an ingest manifest as described in this section. They can be created in any way desired. Ingest manifests are just one way we plan to use to create ingest batch objects for several categories of content.

Clone this wiki locally