On this page

Staging Repositories

Note:  A new staging repository (based on Node.JS) now exists. The staging repository described on this page will be deprecated.

Background

In a typical Aspire installation, connectors acquire content from content repositories (SharePoint, Documentum, file systems for example). The content is then processed and published to a search engine. By default, the connector does not write a copy of the content to local disk. It is assumed that the content will eventually get passed to a publisher and when the job is finished, it is thrown away – including the metadata and any contents. As such, the original content from the repository is never stored on the Aspire server.

A possible problem with this approach is when it takes a long time to crawl, either due to speed or size recrawling can be "too expensive". If we need to “re-process” the content, we must re-fetch the content from the repository. In addition, Aspire may be used in “big-data” installations where data collection and processing must be separated in order to allow “merging” of data from multiple different sources, or to allow multiple, different types of processing to be applied to the same collected data. 

The solution to this problem is to split the scanning and processing phases of the connector and store the content between phases. This will allow reprocessing of the content without the need to re-fetch the content from the repository. Once the content is stored locally on disk, it will be possible to process the content as many times as required, without reference back to the source repository.

The Staging Repository

For this purpose, Aspire supports a “Staging Repository” – an intermediate repository where content can be placed, after it has been extracted from the content repository, but before any processing has occurred. The “Staging Repository” architecture is shown below:

File Staging arch.PNG

Because this is a two phase process there are two or more content sources configured. The first content source is a configuration with a connector configured to crawl a “source” repository. Rather than the Aspire workflow publishing to a search engine, in this instance it publishes to a “Staging Repository”. This can also be done through a message queue.The second content source is configured with the Staging Repository connector that crawls the “staging repository”, performs any required processing via workflow rules and publishes the content to the search engine or other destination such as Hadoop HDFS. There is also the "Background processing" case where items from the staging repository are crawled using the Staging Repository connector, content processing is performed to enhance the content and then published back to the same staging repository using the Staging Repository Publisher.

The Repository Access Layer

In order to provide an extensible architecture, the Staging Repository is implemented as a library. This library holds all of the functionality required of the Staging repository (store, get etc), but does not provide the Aspire functionality (ie the aspire components to publish to, or crawl from it). Thus, when a new type of Staging Repository is required, a new Access Layer is implemented for the new repository type. This can then be easily integrated with Aspire to provide the connectors and publishers required to allow the repository to be used within Aspire.

Staging Repository Functionality

Any staging repository must support:

  • Multiple content sources, keeping the data separate
  • Multiple “owners” of data within a single content source
    • Where an owner is some processing unit that has produced content. For example, the connector to the original content source would be one owner, and a background processor that has performed OCR on the original content would be another owner
  • Be published to, storing metadata and a content stream for a given item based on content source, owner and id
  • Retrieving metadata and or content for a specific item based on content source, owner and id
  • Removing items from the store based on content source, owner and id
  • Clearing the repository for a particular content source
  • Clearing the repository for a particular owner and content source
  • Clearing the entire repository
  • Multiple concurrent reads and writes
  • Return of an iterator of all content in the repository
  • Return of an iterator of all content in the repository for a specific content source
  • Return of an iterator of all content in the repository for a specific content source and owner
  • Return of an iterator of all updated content in the repository
  • Return of an iterator of all updated content in the repository for a specific content source
  • Return of an iterator of all updated content in the repository for a specific content source and owner
  • Optionally encrypting the data stored in the repository
  • Optionally compressing the data stored in the repository

This functionality is provided via a Java interface.

Types of Staging Repository

Currently, the following types of staging repository are supported:

  • File system

In the future, other types of staging repository may be developed:

  • JCR
  • Amazon S3
  • Hadoop’s HDFS

Additional information

Developer information

See here for information on developing new staging repositories.

Documentation in Microsoft Word format

The complete documentation for Staging Repositories can be found in this document.

Training Material

If you're interested in learning more, here's a recording of the Tech Talk on the Staging repository along with the presentation.




  • No labels