Aspire 3.2 introduced a new way of crawling the repositories (extract documents) using a new framework that distributes the scanning and processing among any number of Aspire servers, using a document queue stored in MongoDB.

We call items the logical information units that can be extracted individually from the repositories, they can be documents, folders, comments, etc.

The principle behind this architecture is the hierarchical structure of most repositories, in which some items can have children (attachments, sub-files, etc.), this leads to classify any item as one of two different types:

Container items
Can hold zero or more sub-items (which can also be containers). An example of this are the folders in the FileSystem Connector, which can contain either more folders or items.
Items
Are items that can't contain any children, and they only need to be populated with metadata and processed.

From these two types of items we derive two different types of processes:

Scan
Literally "scans" the container items and discover more items to be either scanned or just populated.
Populate & Fetch
Gets the items metadata and fetches the content from the repository.

Every time the Scan discovers a new item, this item gets enqueued in one of the MongoDB queues to be either scanned or populated by any Aspire server connected to the same MongoDB.

The following diagram shows the flow of the items in a connector crawl process.

Aspire 3.2 Connector Framework Image Modified

Info
Note that both Scan and Populate & Fetch run in multiple threads. And also that you can have more than one Aspire Server running each crawl in parallel.

As showed in the diagram the next step after populating metadata and fetching the content is sending the items to the Workflow to be processed by any custom rule or published to a Search Engine.

MongoDB Databases & Collections

Each content source installed in Aspire will create its own database in MongoDB, using the System Name that Aspire assigns to it.

Image AddedIn order to determine which is this name, click the Content Source name and enter the configuration page and look for the System Name field in the General section.

...

Note
If you want to run a content source in more than one Aspire server, make sure the content sources have the same System Name.

Each database holds several collections for its crawl usage:

audit
Holds all the actions of each of the items being processed

This is an example of a document in this collection:

Code Block

language	js
theme	DJango

{
    "_id" : ObjectId("571f94498cd956261c112156"),
    "id" : "C:\\dev-temp\\testdata\\A\\0\\0\\0\\3.txt",
    "crawlStart" : NumberLong(1461687363561),
    "url" : "file://C:/dev-temp/testdata/A/0/0/0/3.txt",
    "type" : "job",
    "action" : "ADD",
    "batch" : null,
    "ts" : NumberLong(1461687366086)
}

_id: Automatically generated unique id
id: Id of the document
crawlStart: the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
url: The url used for fetching the document
type: can be either job or batch,this is used to identify if the audit correspond to a single document or a batch metadata
batch: the id of the batch that processed the document
ts: the time when this entry was added

errors
Holds all errors that happened during the crawls.
This is an example of a document in this collection:

Code Block

language	js
theme	DJango

{
    "_id" : ObjectId("571f940e8cd956261c112151"),
    "error" : {
        "@time" : NumberLong(1461687310975),
        "@crawlTime" : NumberLong(1461686819675),
        "@cs" : "File_System_Source",
        "@processor" : "File_System_Source-10.10.20.203:50505",
        "@type" : "S",
        "_$" : "Error starting crawl\ncom.searchtechnologies.aspire.services.AspireException: Bad 'exclude' regex pattern: C:\\dev-temp\\testdata\\A ..."
    }
}

_id Automatically generated unique id
error
@time The time when this error happened
@crawlTime the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
@cs The name of the content source that generated this error
@processor The name of the server that generated this error
@type The type of error, it can be "S" for scanner error, "D" for document error, "B" for batch error, "F" for Content Source startup failure, or "U" for Unknown
_$ The detailed error message

hierarchy
Holds the parent hierarchy for all the container items. This is used to generate the item hierarchy in the Populate & Fetch stage.
This is an example of a document in this collection:

Code Block

language	js
theme	DJango

{
    "_id" : "C:\\dev-temp\\testdata\\B\\5\\9\\3",
    "name" : "3",
    "ancestors" : {
        "_id" : "C:\\dev-temp\\testdata\\B\\5\\9",
        "name" : "9",
        "ancestors" : {
            "_id" : "C:\\dev-temp\\testdata\\B\\5",
            "name" : "5",
            "ancestors" : {
                "_id" : "C:\\dev-temp\\testdata\\B",
                "name" : "B",
                "ancestors" : {
                    "_id" : "C:\\dev-temp\\testdata",
                    "name" : "testdata",
                    "ancestors" : null
                }
            }
        }
    }
}

_id The Id of the container item
name The name of the container item
ancestors All the ancestors of this item
_id Id of the parent container item
name The name of the parent container item
ancestors All the ancestors of the parent item
... All the grandparents and beyond

processQueue
scanQueue
snapshots
statistics
status
usersAndGroups

Page tree

Versions Compared

Old Version 2

New Version 3

Key

MongoDB Databases & Collections

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

MongoDB Databases & Collections