You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Aspire 3.2 introduced a new way of crawling the repositories (extract documents) using a new framework that distributes the scanning and processing among any number of Aspire servers, using a document queue stored in MongoDB.

We call items the logical information units that can be extracted individually from the repositories, they can be documents, folders, comments, etc.

The principle behind this architecture is the hierarchical structure of most repositories, in which some items can have children (attachments, sub-files, etc.), this leads to classify any item as one of two different types:

  • Container items
    Can hold zero or more sub-items (which can also be containers). An example of this are the folders in the FileSystem Connector, which can contain either more folders or items.
  • Items
    Are items that can't contain any children, and they only need to be populated with metadata and processed.

 

From these two types of items we derive two different types of processes:

  • Scan
    Literally "scans" the container items and discover more items to be either scanned or just populated.
  • Populate & Fetch
    Gets the items metadata and fetches the content from the repository.

 

Every time the Scan discovers a new item, this item gets enqueued in one of the MongoDB queues to be either scanned or populated by any Aspire server connected to the same MongoDB.

 

The following diagram shows the flow of the items in a connector crawl process.

Aspire 3.2 Connector Framework

Note that both Scan and Populate & Fetch run in multiple threads. And also that you can have more than one Aspire Server running each crawl in parallel.

 

As showed in the diagram the next step after populating metadata and fetching the content is sending the items to the Workflow to be processed by any custom rule or published to a Search Engine.

MongoDB Databases & Collections

Each content source installed in Aspire will create its own database in MongoDB, using the System Name that Aspire assigns to it. In order to determine which is this name, click the Content Source name and enter the configuration page and look for the System Name field in the General section.

 

If you want to run a content source in more than one Aspire server, make sure the content sources have the same System Name.

 

Each database holds several collections for its crawl usage:

 

  • No labels