Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Aspire 3.2 introduced a new way of crawling the repositories (extract documents) using a new framework that distributes the scanning and processing among any number of Aspire servers, using a document queue stored in MongoDB.

We call items the logical information units that can be extracted individually from the repositories, they can be documents, folders, comments, etc.

The principle behind this architecture is the hierarchical structure of most repositories, in which some items can have children (attachments, sub-files, etc.), this leads to classify any item as one of two different types:

  • Container items
    Can hold zero or more sub-items (which can also be containers). An example of this are the folders in the FileSystem Connector, which can contain either more folders or items.
  • Items
    Are items that can't contain any children, and they only need to be populated with metadata and processed.

 

From these two types of items we derive two different types of processes:

  • Scan
    Literally "scans" the container items and discover more items to be either scanned or just populated.
  • Populate & Fetch
    Gets the items metadata and fetches the content from the repository.

 

Every time the Scan discovers a new item, this item gets enqueued in one of the MongoDB queues to be either scanned or populated by any Aspire server connected to the same MongoDB.

 

The following diagram shows the flow of the items in a connector crawl process.

Aspire 3.2 Connector FrameworkImage Modified

Info

Note that both Scan and Populate & Fetch run in multiple threads. And also that you can have more than one Aspire Server running each crawl in parallel.

 

As showed in the diagram the next step after populating metadata and fetching the content is sending the items to the Workflow to be processed by any custom rule or published to a Search Engine.

MongoDB Databases & Collections

Each content source installed in Aspire will create its own database in MongoDB, using the System Name that Aspire assigns to it.

 

In order to determine which is this name, click the Content Source name and enter the configuration page and look for the System Name field in the General section.Image AddedIn order to determine which is this name, click the Content Source name and enter the configuration page and look for the System Name field in the General section.

...

 

Note

If you want to run a content source in more than one Aspire server, make sure the content sources have the same System Name.

 

Each database holds several collections for its crawl usage:

  • audit
    Holds all the actions of each of the items being processed

    This is an example of a document in this collection:

    Code Block
    languagejs
    themeDJango
    {
        "_id" : ObjectId("571f94498cd956261c112156"),
        "id" : "C:\\dev-temp\\testdata\\A\\0\\0\\0\\3.txt",
        "crawlStart" : NumberLong(1461687363561),
        "url" : "file://C:/dev-temp/testdata/A/0/0/0/3.txt",
        "type" : "job",
        "action" : "ADD",
        "batch" : null,
        "ts" : NumberLong(1461687366086)
    }

     

    • _id: Automatically generated unique id
    • id: Id of the document
    • crawlStart: the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
    • url: The url used for fetching the document
    • type: can be either job or batch,this is used to identify if the audit correspond to a single document or a batch metadata  
    • batch: the id of the batch that processed the document
    • ts: the time when this entry was added
  • errors
    Holds all errors that happened during the crawls.
    This is an example of a document in this collection:

    Code Block
    languagejs
    themeDJango
    {
        "_id" : ObjectId("571f940e8cd956261c112151"),
        "error" : {
            "@time" : NumberLong(1461687310975),
            "@crawlTime" : NumberLong(1461686819675),
            "@cs" : "File_System_Source",
            "@processor" : "File_System_Source-10.10.20.203:50505",
            "@type" : "S",
            "_$" : "Error starting crawl\ncom.searchtechnologies.aspire.services.AspireException: Bad 'exclude' regex pattern: C:\\dev-temp\\testdata\\A ..."
        }
    }

     

    • _id Automatically generated unique id
    • error
      • @time The time when this error happened
      • @crawlTime the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
      • @cs The name of the content source that generated this error
      • @processor The name of the server that generated this error
      • @type The type of error, it can be "S" for scanner error, "D" for document error, "B" for batch error, "F" for Content Source startup failure, or "U" for Unknown
      • _$ The detailed error message
  • hierarchy
    Holds the parent hierarchy for all the container items. This is used to generate the item hierarchy in the Populate & Fetch stage.
    This is an example of a document in this collection:

    Code Block
    languagejs
    themeDJango
    {
        "_id" : "C:\\dev-temp\\testdata\\B\\5\\9\\3",
        "name" : "3",
        "ancestors" : {
            "_id" : "C:\\dev-temp\\testdata\\B\\5\\9",
            "name" : "9",
            "ancestors" : {
                "_id" : "C:\\dev-temp\\testdata\\B\\5",
                "name" : "5",
                "ancestors" : {
                    "_id" : "C:\\dev-temp\\testdata\\B",
                    "name" : "B",
                    "ancestors" : {
                        "_id" : "C:\\dev-temp\\testdata",
                        "name" : "testdata",
                        "ancestors" : null
                    }
                }
            }
        }
    }

     

    • _id The Id of the container item
    • name The name of the container item
    • ancestors All the ancestors of this item
      • _id Id of the parent container item
      • name The name of the parent container item
      • ancestors All the ancestors of the parent item
        • ...  All the grandparents and beyond
  • processQueue
  • scanQueue
  • snapshots
  • statistics
  • status
  • usersAndGroups