Aspire 3.2 introduces a new way of crawling the repositories (extract documents) using a framework that distributes the scanning and processing among any number of Aspire servers, using a document queue stored in MongoDB.
We call items the logical information units that can be extracted individually from the repositories. They can be documents, folders, comments, etc.
The principle behind this architecture is the hierarchical structure of most repositories, in which some items can have children (attachments, sub-files, etc.). This leads to classifying any item as one of two different types:
From these two types of items, we derive two different types of processes.
Every time the Scan discovers a new item, this item gets enqueued in one of the MongoDB queues to be either scanned or populated by any Aspire server connected to the same No SQL database.
The following diagram shows the flow of the items in a connector crawl process.
Note that both Scan and Populate & Fetch run in multiple threads. And also that you can have more than one Aspire Server running each crawl in parallel.
As shown in the diagram, the next step after populating metadata and fetching the content is sending the items to the Workflow to be processed by any custom rule or published to a Search Engine.
Each content source installed in Aspire will create its own database in MongoDB using the System Name that Aspire assigns to it.
In order to determine which is this name, click the Content Source name and enter the configuration
page and look for the System Name field in the General section.
If you want to run a content source in more than one Aspire server, make sure the content sources have the same System Name.
Each database holds several collections for its crawl usage:
audit
Holds all the actions of each of the items being processed
This is an example of a document in this collection:
{ "_id" : ObjectId("571f94498cd956261c112156"), "id" : "C:\\dev-temp\\testdata\\A\\0\\0\\0\\3.txt", "crawlStart" : NumberLong(1461687363561), "url" : "file://C:/dev-temp/testdata/A/0/0/0/3.txt", "type" : "job", "action" : "ADD", "batch" : null, "ts" : NumberLong(1461687366086) }
errors
Holds all errors that happened during the crawls.
This is an example of a document in this collection:
{ "_id" : ObjectId("571f940e8cd956261c112151"), "error" : { "@time" : NumberLong(1461687310975), "@crawlTime" : NumberLong(1461686819675), "@cs" : "File_System_Source", "@processor" : "File_System_Source-10.10.20.203:50505", "@type" : "S", "_$" : "Error starting crawl\ncom.searchtechnologies.aspire.services.AspireException: Bad 'exclude' regex pattern: C:\\dev-temp\\testdata\\A ..." } }
hierarchy
Holds the parent hierarchy for all the container items. This is used to generate the item hierarchy in the Populate & Fetch stage.
This is an example of a document in this collection:
{ "_id" : "C:\\dev-temp\\testdata\\B\\5\\9\\3", "name" : "3", "ancestors" : { "_id" : "C:\\dev-temp\\testdata\\B\\5\\9", "name" : "9", "ancestors" : { "_id" : "C:\\dev-temp\\testdata\\B\\5", "name" : "5", "ancestors" : { "_id" : "C:\\dev-temp\\testdata\\B", "name" : "B", "ancestors" : { "_id" : "C:\\dev-temp\\testdata", "name" : "testdata", "ancestors" : null } } } } }