Staging Repositories
A new staging repository (based on Node.JS) now exists. Call Search Technologies for more details. The staging repository described on this page will be deprecated. To see the new Stagin Repository check StageR - NodeJS
Background
In a typical Aspire installation, connectors acquire content from content repositories (SharePoint, Documentum, file systems for example). The content is then processed and published to a search engine. By default, the connector does not write a copy of the content to local disk. It is assumed that the content will eventually get passed to a publisher and when the job is finished, it is thrown away – including the metadata and any contents. As such, the original content from the repository is never stored on the Aspire server.
Processing Challenges and Solution
If it takes a long time to crawl due to either speed or size, recrawling can be "too expensive". If we need to “re-process” the content, we must re-fetch the content from the repository.
Similarly, Aspire may be used in “big-data” installations where data collection and processing must be separated in order to allow “merging” of data from multiple different sources, or to allow multiple, different types of processing to be applied to the same collected data.
A solution to this problem is to
- Split the scanning and processing phases of the connector and
- Store the content between phases.
Benefits: This will allow reprocessing of the content without the need to re-fetch the content from the repository.
Once the content is stored locally on disk, it will be possible to process the content as many times as required, without reference back to the source repository.
The Staging Repository
Aspire supports a “Staging Repository” – an intermediate repository where content can be placed, after it has been extracted from the content repository, and before any processing has occurred.
The “Staging Repository” architecture is shown below:
Since this is a two phase process, there are two or more content sources configured.
- The first content source is a configuration with a connector configured to crawl a “source” repository. Rather than the Aspire workflow publishing to a search engine, it publishes to a “Staging Repository”. This can also be done through a message queue.
- The second content source is configured with the Staging Repository connector that crawls the “staging repository”, performs any required processing via workflow rules and publishes the content to the search engine or other destination such as Hadoop HDFS.
There is also the "Background processing" case where:
- items from the staging repository are crawled using the Staging Repository connector,
- content processing is performed to enhance the content and then
- published back to the same staging repository using the Staging Repository Publisher.
The Repository Access Layer
In order to provide an extensible architecture, the Staging Repository is implemented as a library. This library holds all of the functionality required of the Staging repository (store, get, etc.), but does not provide the Aspire functionality (the Aspire components to publish to or crawl from it).
Thus, when a new type of Staging Repository is required:
- A new Access Layer is implemented for the new repository type.
- This can be integrated with Aspire easily to provide the connectors and publishers required (to allow the repository to be used within Aspire).
Staging Repository Functionality
Any staging repository must support:
Multiple content sources | keeping the data separate |
Multiple “owners” of data | within a single content source
|
Publishing to | stored metadata and a content stream
|
Retrieving metadata and or content | for a specific item based on content source, owner and id |
Removing items | from the store based on content source, owner and id |
Clearing the repository |
|
Clearing the entire repository | |
Multiple concurrent reads and writes | |
Return of an iterator of all content |
|
Return of an iterator of all updated content |
|
Optionally
| stored in the repository |
This functionality is provided via a Java interface.
Types of Staging Repositories
Currently, the following types of staging repositories are supported:
- File system
In the future, other types of staging repositories may be developed:
- JCR
- Amazon S3
- Hadoop’s HDFS
Additional Information
Developer Information
See here for information on developing new staging repositories.
Documentation in Microsoft Word Format
The complete documentation for Staging Repositories can be found in this document.
Training Material
If you're interested in learning more, here's a recording of the Tech Talk on the Staging repository along with the presentation.