You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Aspire 3.0 introduces a new way of crawling the repositories (extract documents) using a framework that distributes the scanning and processing among any number of Aspire servers, using a document queue stored in MongoDB.

We call items the logical information units that can be extracted individually from the repositories. They can be documents, folders, comments, etc.

 

Architecture Design


The principle behind this architecture is the hierarchical structure of most repositories, in which some items can have children (attachments, sub-files, etc.). This leads to classifying any item as one of two different types:

  • Container items
    Can hold zero or more sub-items (which can also be containers). For example, the folders in the FileSystem Connector, which can contain either more folders or items.
  • Items
    Can't contain any children and only need to be populated with metadata and processed.

From these two types of items, we derive two different types of processes.

  • Scan
    Literally "scans" the container items and discover more items to be either scanned or just populated.
  • Populate & Fetch
    Gets the items metadata and fetches the content from the repository.

Every time the Scan discovers a new item, this item gets enqueued in one of the MongoDB queues to be either scanned or populated by any Aspire server connected to the same MongoDB.

 

The following diagram shows the flow of the items in a connector crawl process.

Aspire 3.0 Connector Framework

Note that both Scan and Populate & Fetch run in multiple threads. And also that you can have more than one Aspire Server running each crawl in parallel.

 

As shown in the diagram, the next step after populating metadata and fetching the content is sending the items to the Workflow to be processed by any custom rule or published to a Search Engine.

MongoDB Databases & Collections

Each content source installed in Aspire will create its own database in MongoDB using the System Name that Aspire assigns to it.

 

In order to determine which is this name, click the Content Source name and enter the configuration page and look for the System Name field in the General section.In order to determine which is this name, click the Content Source name and enter the configuration

page and look for the System Name field in the General section.

If you want to run a content source in more than one Aspire server, make sure the content sources have the same System Name.

Each database holds several collections for its crawl usage:

  • audit
    Holds all the actions of each of the items being processed

    This is an example of a document in this collection:

    {
        "_id" : ObjectId("571f94498cd956261c112156"),
        "id" : "C:\\dev-temp\\testdata\\A\\0\\0\\0\\3.txt",
        "crawlStart" : NumberLong(1461687363561),
        "url" : "file://C:/dev-temp/testdata/A/0/0/0/3.txt",
        "type" : "job",
        "action" : "ADD",
        "batch" : null,
        "ts" : NumberLong(1461687366086)
    }

     

    • _id: Automatically generated unique id
    • id: Id of the document
    • crawlStart: the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
    • url: The url used for fetching the document
    • type: can be either job or batch, this is used to identify if the audit correspond to a single document or a batch metadata  
    • batch: the id of the batch that processed the document
    • ts: the time when this entry was added
  • errors
    Holds all errors that happened during the crawls.
    This is an example of a document in this collection:

    {
        "_id" : ObjectId("571f940e8cd956261c112151"),
        "error" : {
            "@time" : NumberLong(1461687310975),
            "@crawlTime" : NumberLong(1461686819675),
            "@cs" : "File_System_Source",
            "@processor" : "File_System_Source-10.10.20.203:50505",
            "@type" : "S",
            "_$" : "Error starting crawl\ncom.searchtechnologies.aspire.services.AspireException: Bad 'exclude' regex pattern: C:\\dev-temp\\testdata\\A ..."
        }
    }

     

    • _id Automatically generated unique id
    • error
      • @time The time when this error happened
      • @crawlTime the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
      • @cs The name of the content source that generated this error
      • @processor The name of the server that generated this error
      • @type The type of error, it can be "S" for scanner error, "D" for document error, "B" for batch error, "F" for Content Source startup failure, or "U" for Unknown
      • _$ The detailed error message
  • hierarchy
    Holds the parent hierarchy for all the container items. This is used to generate the item hierarchy in the Populate & Fetch stage.
    This is an example of a document in this collection:

    {
        "_id" : "C:\\dev-temp\\testdata\\B\\5\\9\\3",
        "name" : "3",
        "ancestors" : {
            "_id" : "C:\\dev-temp\\testdata\\B\\5\\9",
            "name" : "9",
            "ancestors" : {
                "_id" : "C:\\dev-temp\\testdata\\B\\5",
                "name" : "5",
                "ancestors" : {
                    "_id" : "C:\\dev-temp\\testdata\\B",
                    "name" : "B",
                    "ancestors" : {
                        "_id" : "C:\\dev-temp\\testdata",
                        "name" : "testdata",
                        "ancestors" : null
                    }
                }
            }
        }
    }

     

    • _id The Id of the container item
    • name The name of the container item
    • ancestors All the ancestors of this item
      • _id Id of the parent container item
      • name The name of the parent container item
      • ancestors All the ancestors of the parent item
        • ...  All the grandparents and beyond
  • processQueue
  • scanQueue
  • snapshots
  • statistics
  • status
  • usersAndGroups

 

  • No labels