Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Each Connector installed as a Content Source in Aspire will create a new Database in MongoDB with its name, and multiple collections that it uses for crawling and serving security information. The following describes what every collection is used for as well as an explanation of each of the fields of the documents stored in them.

Table of Contents
minLevel2

Document Queues and Metadata

  • processQueue

Manages the items that needs to be processed by the workflow, these items may or may not be sent to scanned.

Field NameExample

Description

_idC:\test-folder\folderA\testDocument.txtThe unique id of the document
metadata[depends on each connector]The necessary metadata fields the connector needs to fetch or populate this document
type[depends on each connector]The serialized version of the ItemType of the document
statusC, P or A

The document processing status:

C: Completed, means it have been already processed

P: in Progress, means it is currently been processed

A: Available, means it is available for been processed

actionadd, update, deleteThe action to be performed to the search engine for the document
timestamp1465334398471The time-stamp when this document was added to the queue
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 signature, when a document changes this signature should also change
parentIdC:\test-folder\folderAThe id of the parent document, in other words the document that scanned the current document
processorFile_System-192.168.1.15:50505The identifier of the Aspire server that processed or is processing the current document
shouldScanfalseDetermines whether or not this document should be considered for scanning
shouldProcesstrueDetermines whether or not this document should be considered for being processed by the workflow
retries0The number of times this document has been retried
nametestDocument.txtThe name of this document
isCrawlRootItemfalseIndicates if this is one of the root crawl items (for internal control)
hierarchyIdC:\test-folder\folderA\testDocument.txtUnique Id for using to generate the hierarchy for this document, it may be different from the _id field

Example:

Code Block
languagejs
{
    "_id" : "C:\\test-folder\\folderA\\testDocument.txt",
    "metadata" : {
        "fetchUrl" : "file://C:/test-folder/folderA/testDocument.txt",
        "url" : "file://C:/test-folder/folderA/testDocument.txt"
    },
    "type" : "vtwqabl6oiadwy3pnuxhgzlbojrwq5dfmnug433mn5twszltfzqxg4djojss4y3pnvyg63tfnz2hglsgnfwgk43zon2gk3kjorsw2vdzobsqaaaaaaaaaaaaciaaa6dsaahguylwmexgyylom4xek3tvnuaaaaaaaaaaaaasaaahq4duaacgm2lmmu",
    "status" : "C",
    "action" : "add",
    "timestamp" : NumberLong(1465334398471),
    "signature" : "CBEC1210FE2D51A8166C3E70D38F8A07",
    "parentId" : "C:\\test-folder\\folderA",
    "processor" : "File_System_Source-192.168.56.1:50505",
    "shouldScan" : false,
    "shouldProcess" : true,
    "retries" : 0,
    "name" : "0.txt",
    "isCrawlRootItem" : false,
    "hiearchyId" : "C:\\test-folder\\folderA\\testDocument.txt"
}

 

  • scanQueue

Manages the items that needs to be scanned by the connector, these items may or may not be have been sent to process previously.

Field NameExample

Description

_idC:\test-folder\folderAThe unique id of the document
metadata[depends on each connector]The necessary metadata fields the connector needs to fetch or populate this document
type[depends on each connector]The serialized version of the ItemType of the document
statusC, P or A

The document processing status:

C: Completed, means it have been already processed

P: in Progress, means it is currently been processed

A: Available, means it is available for been processed

actionadd, update, deleteThe action to be performed to the search engine for the document
timestamp1465334398471The time-stamp when this document was added to the queue
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 signature, when a document changes this signature should also change
parentIdC:\test-folderThe id of the parent document, in other words the document that scanned the current document
processorFile_System-192.168.1.15:50505The identifier of the Aspire server that processed or is processing the current document

shouldScan

falseDetermines whether or not this document should be considered for scanning
shouldProcesstrueDetermines whether or not this document was considered for being processed by the workflow
retries0The number of times this document has been retried

name

folderAThe name of this document
isCrawlRootItemfalseIndicates if this is one of the root crawl items (for internal control)
hierarchyIdC:\test-folder\folderA\testDocument.txtUnique Id for using to generate the hierarchy for this document, it may be different from the _id field

 

Example:

Code Block
languagejs
{
    "_id" : "C:\\test-folder\\folderA",
    "metadata" : {
        "fetchUrl" : "file://C:/test-folder/folderA",
        "url" : "file://C:/test-folder/folderA",
        "displayUrl" : "C:\\test-folder\\folderA",
        "lastModified" : "2016-02-23T17:08:55Z",
        "dataSize" : 0,
        "acls" : null
    },
    "type" : "vtwqabl6oiadwy3pnuxhgzlbojrwq5dfmnug433mn5twszltfzqxg4djojss4y3pnvyg63tfnz2hglsgnfwgk43zon2gk3kjorsw2vdzobsqaaaaaaaaaaaaciaaa6dsaahguylwmexgyylom4xek3tvnuaaaaaaaaaaaaasaaahq4duaadgm33mmrsxe",
    "status" : "C",
    "action" : "add",
    "timestamp" : NumberLong(1465334398103),
    "signature" : "CD2C65824E45BFE94C71970EEEA18A8C",
    "parentId" : "C:\\test-folder",
    "processor" : "File_System_Source-192.168.56.1:50505",
    "shouldScan" : true,
    "shouldProcess" : true,
    "retries" : 0,
    "name" : "folderA",
    "isCrawlRootItem" : false,
    "hiearchyId" : "C:\\test-folder\\folderA"
}

 

  • hierarchy

Holds the hierarchy information about every single parent document scanned by the connector, each parent contains the information about all its parents all the way up to the root document.

 

Field NameExampleDescription
_idC:\test-folder\folderAUnique id of the parent document
namefolderAName to be used in the hierarchy metadata
ancestors[parent hierarchy info]Holds the same information but for the parent of document, or null if this is a root document

 

Example:

Code Block
{
    "_id" : "C:\\test-folder\\folderA",
    "name" : "folderA",
    "ancestors" : {
        "_id" : "C:\\test-folder",
        "name" : "test-folder",
        "ancestors" : null
    }
}

 

Statistics and Logging

  • audit

Holds the actions done by the content source for each of the documents.

Field NameExampleDescription
_idObjectId("5750bfa610163e3f58fd7019")Mongo Internal ID
idC:\\test-folder\\folderA\\testDocument.txtUnique Id of the document
crawlStart1464909728339Crawl identifier, each crawl has a different crawlStart time
urlfile://C:/test-folder/folderA/testDocument.txtURL of the document
typejob or batchSpecifies what type of audit log is the current object
action

ADD, UPDATE, NOCHANGE, DELETE, BATCH_COMPLETED, BATCH_ERROR, WORKFLOW_COMPLETE, WORKFLOW_TERMINATED, WORKFLOW_ERROR or EXCLUDED

ADD: Discovered as new document to be added

UPDATE: Discovered document with a change

NOCHANGE: Found no change in document

DELETE:  Document was found to be deleted

BATCH_COMPLETED:  The current batch finished

BATCH_ERROR: There was an error closing the batch

WORKFLOW_COMPLETE: The document completed the workflow without errors 

WORKFLOW_TERMINATED: The document was terminated during the workflow

WORKFLOW_ERROR: The document had an error executing the workflow

EXCLUDED: The document was excluded by the include/exclude patterns

batch

    10.10.20.203:50506/2016-06-03T16:04:59Z/batch-0

If any, contains the id of the batch of the current document
ts1464970015441The time this entry was added to the log

 

Example:

Code Block
{
    "_id" : ObjectId("5751ab210afca2469094bb23"),
    "id" : "C:\\test-folder\\folderA\\testDocument.txt",
    "crawlStart" : NumberLong(1464970009642),
    "url" : "file://C:/test-folder/folderA/testDocument.txt",
    "type" : "job",
    "action" : "WORKFLOW_COMPLETE",
    "batch" : "10.10.20.203:50506/2016-06-03T16:04:59Z/batch-0",
    "ts" : NumberLong(1464970015441)
}

 

  • errors

Holds the possible document errors

  • statistics

  • that occurs either in the scanning or workflow processing.

    Field NameExampleDescription
    _idObjectId("576844914b4ae74664a414bd")Mongo's internal id
    error/@time1466451089287Time when this error entry was logged
    error/@crawlTime1466451085183Identifier of the crawl
    error/@csFile_System_SourceIdentifier of the content source
    error/@processorFile_System_Source-192.168.56.1:50505The server that processed and reported this error
    error/@type

    S, D, B, F or U

    S: Scanner errors relates to errors caused in the connector scanning stages

    D: Document errors relates to fetch, text extraction or workflow processing errors

    B: Batch errors relates to failed batches of Aspire jobs

    F: Failed errors are not currently being used but they could be later

    U: Unknown errors relates to errors where the source is unknown

    error/_$Error processing: C:\\test-folder/folderA/testDocument2.txt\ncom.searchtechnologies.aspire.services.AspireException: Exception whilst running script: Rule: 1\r\n\tat..... (more)The error message

     

    Example:

    Code Block
    languagejs
    {
        "_id" : ObjectId("576844914b4ae74664a414bd"),
        "error" : {
            "@time" : NumberLong(1466451089287),
            "@crawlTime" : NumberLong(1466451085183),
            "@cs" : "File_System_Source",
            "@processor" : "File_System_Source-192.168.56.1:50505",
            "@type" : "D",
            "_$" : "Error processing: C:\\test-folder/folderA/testDocument2.txt\ncom.searchtechnologies.aspire.services.AspireException: Exception whilst running script: Rule: 1\r\n\tat ... (more)"
        }
    }
    

     

    • statistics

    Holds the crawl statistics per server, what you see in the Administration UI is the sum of all the server statistics associated with the same crawl identified.

    FieldNameExampleDescription
    _id1466450887680-File_System_Source-192.168.56.1:50505Unique identifier of each statistics object
    statistics/@processorFile_System_Source-192.168.56.1:50505The server+content source name
    statistics/@server192.168.56.1:50505The server identifier
    statistics/@statusA, S, E, F, L, I, N, IP, IWP, IWR, X, IWS or U

    The crawl status:

    A: Aborted

    S: Completed

    E: Errored

    F: Failed

    L: Loading

    I: In-Progress

    N: New

    iP: Paused

    IWP: Pausing

    IWR: Resuming

    X: Stopped

    IWS: Stopping

    U: Unknown

    statistics/@modeF, FR, I, IR, R, T, U

    F: Full crawl

    FR: Full recovery

    I: Incremental crawl

    IR: Incremental recovery

    R: Real time

    T: Test

    U: Unknown

    statistics/@startTime1466450887680The time when the crawl started
    statistics/@endTime1466450905466The time when the crawl ended
    statistics/@csFile_System_SourceThe identifier of the content source
    statistics/queue/scan/@toScan0Number of documents in the scan queue pending to be scanned
    statistics/queue/scan/@scanning0Number of documents from the scan queue currently being scanned
    statistics/queue/scan/@scanned11Number of documents from the scan queue already scanned
    statistics/queue/scan/@total11Total documents in the scan queue
    statistics/queue/process/@toProcess0Number of documents in the process queue pending to be processed
    statistics/queue/process/@processing0Number of documents from the process queue currently being processed
    statistics/queue/process/@processed121Number of documents from the process queue already processed
    statistics/queue/process/@total121Total documents in the process queue
    statistics/nProgress/@adding0Number of documents currently being processed as "ADD"
    statistics/inProgress/@updating0Number of documents currently being processed as "UPDATE"
    statistics/inProgress/@deleting0Number of documents currently being processed as "DELETE"
    statistics/inProgress/@total0Total documents currently being processed
    statistics/processed/@added121Number of documents processed as "ADD"
    statistics/processed/@updated0Number of documents processed as "UPDATE"
    statistics/processed/@deleting0Number of documents processed as "DELETE"
    statistics/processed/@unchanged0Number of documents processed as "NOCHANGE"
    statistics/processed/@excluded0Number of documents "EXCLUDED" from being processed
    statistics/processed/@terminated0Number of documents processed but ended as "TERMINATED"
    statistics/processed/@errored0Number of documents processed with Errors
    statistics/processed/@bytes129470Total bytes processed so far
    statistics/processed/@total121Total number of documents processed
    statistics/errors/@batch0Number of batch errors
    statistics/errors/@scan0Number of scanner errors (errors that happened while scanning for documents)
    statistics/errors/@document0Number of document errors (errors that occurred while processing the document)
    statistics/errors/@total0Total number of errors

     

    Example:

    Code Block
    languagejs
    {
        "_id" : "1466450887680-File_System_Source-192.168.56.1:50505",
        "statistics" : {
            "@processor" : "File_System_Source-192.168.56.1:50505",
            "@server" : "192.168.56.1:50505",
            "@status" : "S",
            "@mode" : "F",
            "@startTime" : NumberLong(1466450887680),
            "@endTime" : NumberLong(1466450905466),
            "@cs" : "File_System_Source",
            "queue" : {
                "scan" : {
                    "@toScan" : 0,
                    "@scanning" : 0,
                    "@scanned" : 11,
                    "@total" : 11
                },
                "process" : {
                    "@toProcess" : 0,
                    "@processing" : 0,
                    "@processed" : 121,
                    "@total" : 121
                }
            },
            "inProgress" : {
                "@adding" : 0,
                "@updating" : 0,
                "@deleting" : 0,
                "@total" : 0
            },
            "processed" : {
                "@added" : 121,
                "@updated" : 0,
                "@deleting" : 0,
                "@unchanged" : 0,
                "@excluded" : 0,
                "@terminated" : 0,
                "@errored" : 0,
                "@bytes" : 129470,
                "@total" : 121
            },
            "errors" : {
                "@batch" : 0,
                "@scan" : 0,
                "@document" : 0,
                "@total" : 0
            }
        }
    }

     

    Controlling and Incremental

    • status

    Holds all the crawl control information and its status, this determines when a crawl should be started, paused, stopped, or even complete as successful.

    Field NameExampleDescription
    _idObjectId("5768448d4b4ae74664a41495")Mongo's internal ID
    connectorSource[depends on specific connector]Contains the configuration set for running a new crawl, it depends on what each specific connector needs as configuration
    @actionstartProperty from the scheduler specifying the action to be done for this content source. For crawls it should always be "start"
    @actionPropertiesfull or incrementalProperty from the scheduler specifying if the crawl should be either an incremental or a full
    @crawlId0Aspire's internal ID for the crawl
    @normalizedCSNameFile_System_SourceAspire's internal name for the current content source
    displayNameFile System SourceThe content source name as the user entered it
    @schedulerAspireSystemSchedulerIdentifies the scheduler that created the crawl request. By default it should be "AspireSystemScheduler"
    @scheduleId0The schedule id corresponding to the current crawl request.
    @jobNumber5A sequential counter of how many jobs has the scheduler served.
    @sourceIdFile_System_SourceThe content source identifier
    @actionTypemanual or scheduledDetermines if the crawl was started by a periodic schedule or a manual request
    @dbId1Legacy property from the scheduler
    crawlStart1466460900181The time in milliseconds that this request was created, this will be used as the crawl identifier for rest of the crawl life
    crawlStatusA, S, E, F, L, I, N, IP, IWP, IWR, X, IWS or U

    The crawl status:

    A: Aborted

    S: Completed

    E: Errored

    F: Failed

    L: Loading

    I: In-Progress

    N: New

    iP: Paused

    IWP: Pausing

    IWR: Resuming

    X: Stopped

    IWS: Stopping

    U: Unknown

    processDeletesnoneIf any, holds the ID of the server that is scanning through the snapshots to find the deletes at the end of the crawl
    processingDeletesStatusfinishedThis flag is only present when the deletes processing is finished
    crawlEnd1466460912343If any, the time in milliseconds that this crawl finished

     

    Example:

    Code Block
    languagejs
    {
        "_id" : ObjectId("57686d0d4b4ae74664a417a8"),
        "connectorSource" : {
            "url" : "C:\\test-folder",
            "partialScan" : "false",
            "subDirUrl" : null,
            "indexContainers" : "true",
            "scanRecursively" : "true",
            "scanExcludedItems" : "false",
            "useACLs" : "false",
            "acls" : null,
            "includes" : null,
            "excludes" : null
        },
        "@action" : "start",
        "@actionProperties" : "full",
        "@crawlId" : "0",
        "@normalizedCSName" : "File_System_Source",
        "displayName" : "File System Source",
        "@scheduler" : "AspireSystemScheduler",
        "@scheduleId" : "2",
        "@jobNumber" : "7",
        "@sourceId" : "File_System_Source",
        "@actionType" : "manual",
        "@dbId" : "2",
        "crawlStart" : NumberLong(1466461453589),
        "crawlStatus" : "S",
        "processDeletes" : "none",
        "processingDeletesStatus" : "finished",
        "crawlEnd" : NumberLong(1466461465352)
    }

     

    • snapshots

    Holds the incremental information needed for determining when a document has changed, have been added or when the get deleted. This is only used by the connectors where it's repositories APIs doesn't provide a way of getting the updates from a single call without having to scan through all the documents again.

     

    Field NameExampleDescription
    _idC:\test-folder\folderA\testDocument.txtThe unique ID of each document
    containertrue or false

    true: If this document can contain documents

    false: If it doesn't

    crawld0The id of the crawl that introduced this entry
    signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 digest of the main metadata of each document needed for determine changes
    timestamp1466461845637The crawlStart time
    parentIdC:\test-folder\folderAThe name of the parent of this item
    errortrue or false

    true: if this document had an error

    false: otherwise

     

    Example:

    Code Block
    languagejs
    {
        "_id" : "C:\\test-folder\\folderA\testDocument.txt",
        "container" : false,
        "crawlId" : 0,
        "signature" : "CBEC1210FE2D51A8166C3E70D38F8A07",
        "timestamp" : NumberLong(1466461845637),
        "parentId" : "C:\\test-folder\\folderA",
        "error" : false
    }

     

    Controlling and Incremental

    • status

    • snapshots