Staging Repositories


A new staging repository (based on Node.JS) now exists. Call Search Technologies for more details. The staging repository described on this page will be deprecated. To see the new Stagin Repository check StageR - NodeJS

Background

In a typical Aspire installation, connectors acquire content from content repositories (SharePoint, Documentum, file systems for example). The content is then processed and published to a search engine. By default, the connector does not write a copy of the content to local disk. It is assumed that the content will eventually get passed to a publisher and when the job is finished, it is thrown away – including the metadata and any contents. As such, the original content from the repository is never stored on the Aspire server.


Processing Challenges and Solution

If it takes a long time to crawl due to either speed or size, recrawling can be "too expensive". If we need to “re-process” the content, we must re-fetch the content from the repository.

Similarly, Aspire may be used in “big-data” installations where data collection and processing must be separated in order to allow “merging” of data from multiple different sources, or to allow multiple, different types of processing to be applied to the same collected data.

A solution to this problem is to

  1. Split the scanning and processing phases of the connector and 
  2. Store the content between phases.

Benefits: This will allow reprocessing of the content without the need to re-fetch the content from the repository.

Once the content is stored locally on disk, it will be possible to process the content as many times as required, without reference back to the source repository.

The Staging Repository

Aspire supports a “Staging Repository” – an intermediate repository where content can be placed, after it has been extracted from the content repository, and before any processing has occurred.

The “Staging Repository” architecture is shown below:


File Staging arch.PNG

Since this is a two phase process, there are two or more content sources configured.

  1. The first content source is a configuration with a connector configured to crawl a “source” repository. Rather than the Aspire workflow publishing to a search engine, it publishes to a “Staging Repository”. This can also be done through a message queue.
  2. The second content source is configured with the Staging Repository connector that crawls the “staging repository”, performs any required processing via workflow rules and publishes the content to the search engine or other destination such as Hadoop HDFS. 

There is also the "Background processing" case where:

  • items from the staging repository are crawled using the Staging Repository connector, 
  • content processing is performed to enhance the content and then 
  • published back to the same staging repository using the Staging Repository Publisher.


The Repository Access Layer

In order to provide an extensible architecture, the Staging Repository is implemented as a library. This library holds all of the functionality required of the Staging repository (store, get, etc.), but does not provide the Aspire functionality (the Aspire components to publish to or crawl from it).

Thus, when a new type of Staging Repository is required:

  • A new Access Layer is implemented for the new repository type. 
  • This can be integrated with Aspire easily to provide the connectors and publishers required (to allow the repository to be used within Aspire).


Staging Repository Functionality

Any staging repository must support:

Multiple content sourceskeeping the data separate
Multiple “owners” of data

within a single content source

    • where an owner is some processing unit that has produced content.
    • For example, the connector to the original content source would be one owner, and a background processor that has performed OCR on the original content would be another owner

Publishing to

stored metadata and a content stream

  • for a given item
  • based on content source, owner and id

Retrieving metadata and or content

for a specific item based on content source, owner and id

Removing items

from the store based on content source, owner and id

Clearing the repository

  • for a particular content source
  • for a particular owner and content source

Clearing the entire repository


Multiple concurrent reads and writes


Return of an iterator of all content
  • in the repository
  • in the repository for a specific content source
  • Return of an iterator of all content in the repository for a specific content source and owner
Return of an iterator of all updated content
  • in the repository
  • in the repository for a specific content source
  • in the repository for a specific content source and owner

Optionally

  • encrypting the data
  • compressing the data
stored in the repository


This functionality is provided via a Java interface.


Types of Staging Repositories

Currently, the following types of staging repositories are supported:

  • File system


In the future, other types of staging repositories may be developed:

  • JCR
  • Amazon S3
  • Hadoop’s HDFS


Additional Information

Developer Information

See here for information on developing new staging repositories.

Documentation in Microsoft Word Format

The complete documentation for Staging Repositories can be found in this document.

Training Material

If you're interested in learning more, here's a recording of the Tech Talk on the Staging repository along with the presentation.

  • No labels