Page tree
Skip to end of metadata
Go to start of metadata

Each Connector installed as a Content Source in Aspire will create a new Database in MongoDB with its name, and multiple collections that it uses for crawling and serving security information.
The following describes what every collection is used for as well as an explanation of each of the fields of the documents stored in them.



Document Queues and Metadata


processQueue

Manages the items that needs to be processed by the workflow, these items may or may not be sent to scanned.

Field NameExample

Description

_idC:\test-folder\folderA\testDocument.txtThe unique id of the document
metadata[depends on each connector]The necessary metadata fields the connector needs to fetch or populate this document
type[depends on each connector]The serialized version of the ItemType of the document
statusC, P or A

The document processing status:

C: Completed, means it have been already processed

P: in Progress, means it is currently been processed

A: Available, means it is available for been processed

actionadd, update, deleteThe action to be performed to the search engine for the document
timestamp1465334398471The time-stamp when this document was added to the queue
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 signature, when a document changes this signature should also change
parentIdC:\test-folder\folderAThe id of the parent document, in other words the document that scanned the current document
processorFile_System-192.168.1.15:50505The identifier of the Aspire server that processed or is processing the current document
shouldScanfalseDetermines whether or not this document should be considered for scanning
shouldProcesstrueDetermines whether or not this document should be considered for being processed by the workflow
retries0The number of times this document has been retried
nametestDocument.txtThe name of this document
isCrawlRootItemfalseIndicates if this is one of the root crawl items (for internal control)
hierarchyIdC:\test-folder\folderA\testDocument.txtUnique Id for using to generate the hierarchy for this document, it may be different from the _id field

Example

{
    "_id" : "C:\\test-folder\\folderA\\testDocument.txt",
    "metadata" : {
        "fetchUrl" : "file://C:/test-folder/folderA/testDocument.txt",
        "url" : "file://C:/test-folder/folderA/testDocument.txt"
    },
    "type" : "vtwqabl6oiadwy3pnuxhgzlbojrwq5dfmnug433mn5twszltfzqxg4djojss4y3pnvyg63tfnz2hglsgnfwgk43zon2gk3kjorsw2vdzobsqaaaaaaaaaaaaciaaa6dsaahguylwmexgyylom4xek3tvnuaaaaaaaaaaaaasaaahq4duaacgm2lmmu",
    "status" : "C",
    "action" : "add",
    "timestamp" : NumberLong(1465334398471),
    "signature" : "CBEC1210FE2D51A8166C3E70D38F8A07",
    "parentId" : "C:\\test-folder\\folderA",
    "processor" : "File_System_Source-192.168.56.1:50505",
    "shouldScan" : false,
    "shouldProcess" : true,
    "retries" : 0,
    "name" : "0.txt",
    "isCrawlRootItem" : false,
    "hiearchyId" : "C:\\test-folder\\folderA\\testDocument.txt"
}

scanQueue

Manages the items that needs to be scanned by the connector, these items may or may not be have been sent to process previously.

Field NameExample

Description

_idC:\test-folder\folderAThe unique id of the document
metadata[depends on each connector]The necessary metadata fields the connector needs to fetch or populate this document
type[depends on each connector]The serialized version of the ItemType of the document
statusC, P or A

The document processing status:

C: Completed, means it have been already processed

P: in Progress, means it is currently been processed

A: Available, means it is available for been processed

actionadd, update, deleteThe action to be performed to the search engine for the document
timestamp1465334398471The time-stamp when this document was added to the queue
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 signature, when a document changes this signature should also change
parentIdC:\test-folderThe id of the parent document, in other words the document that scanned the current document
processorFile_System-192.168.1.15:50505The identifier of the Aspire server that processed or is processing the current document

shouldScan

falseDetermines whether or not this document should be considered for scanning
shouldProcesstrueDetermines whether or not this document was considered for being processed by the workflow
retries0The number of times this document has been retried

name

folderAThe name of this document
isCrawlRootItemfalseIndicates if this is one of the root crawl items (for internal control)
hierarchyIdC:\test-folder\folderA\testDocument.txtUnique Id for using to generate the hierarchy for this document, it may be different from the _id field

Example

{
    "_id" : "C:\\test-folder\\folderA",
    "metadata" : {
        "fetchUrl" : "file://C:/test-folder/folderA",
        "url" : "file://C:/test-folder/folderA",
        "displayUrl" : "C:\\test-folder\\folderA",
        "lastModified" : "2016-02-23T17:08:55Z",
        "dataSize" : 0,
        "acls" : null
    },
    "type" : "vtwqabl6oiadwy3pnuxhgzlbojrwq5dfmnug433mn5twszltfzqxg4djojss4y3pnvyg63tfnz2hglsgnfwgk43zon2gk3kjorsw2vdzobsqaaaaaaaaaaaaciaaa6dsaahguylwmexgyylom4xek3tvnuaaaaaaaaaaaaasaaahq4duaadgm33mmrsxe",
    "status" : "C",
    "action" : "add",
    "timestamp" : NumberLong(1465334398103),
    "signature" : "CD2C65824E45BFE94C71970EEEA18A8C",
    "parentId" : "C:\\test-folder",
    "processor" : "File_System_Source-192.168.56.1:50505",
    "shouldScan" : true,
    "shouldProcess" : true,
    "retries" : 0,
    "name" : "folderA",
    "isCrawlRootItem" : false,
    "hiearchyId" : "C:\\test-folder\\folderA"
}

hierarchy

Holds the hierarchy information about every single parent document scanned by the connector, each parent contains the information about all its parents all the way up to the root document.

Field NameExampleDescription
_idC:\test-folder\folderAUnique id of the parent document
namefolderAName to be used in the hierarchy metadata
ancestors[parent hierarchy info]Holds the same information but for the parent of document, or null if this is a root document

Example

{
    "_id" : "C:\\test-folder\\folderA",
    "name" : "folderA",
    "ancestors" : {
        "_id" : "C:\\test-folder",
        "name" : "test-folder",
        "ancestors" : null
    }
}

Statistics and Logging


audit

Holds the actions done by the content source for each of the documents.

Field NameExampleDescription
_idObjectId("5750bfa610163e3f58fd7019")Mongo Internal ID
idC:\\test-folder\\folderA\\testDocument.txtUnique Id of the document
crawlStart1464909728339Crawl identifier, each crawl has a different crawlStart time
urlfile://C:/test-folder/folderA/testDocument.txtURL of the document
typejob or batchSpecifies what type of audit log is the current object
action

ADD, UPDATE, NOCHANGE, DELETE, BATCH_COMPLETED, BATCH_ERROR, WORKFLOW_COMPLETE, WORKFLOW_TERMINATED, WORKFLOW_ERROR or EXCLUDED

ADD: Discovered as new document to be added

UPDATE: Discovered document with a change

NOCHANGE: Found no change in document

DELETE:  Document was found to be deleted

BATCH_COMPLETED:  The current batch finished

BATCH_ERROR: There was an error closing the batch

WORKFLOW_COMPLETE: The document completed the workflow without errors 

WORKFLOW_TERMINATED: The document was terminated during the workflow

WORKFLOW_ERROR: The document had an error executing the workflow

EXCLUDED: The document was excluded by the include/exclude patterns

batch

    10.10.20.203:50506/2016-06-03T16:04:59Z/batch-0

If any, contains the id of the batch of the current document
ts1464970015441The time this entry was added to the log

Example

{
    "_id" : ObjectId("5751ab210afca2469094bb23"),
    "id" : "C:\\test-folder\\folderA\\testDocument.txt",
    "crawlStart" : NumberLong(1464970009642),
    "url" : "file://C:/test-folder/folderA/testDocument.txt",
    "type" : "job",
    "action" : "WORKFLOW_COMPLETE",
    "batch" : "10.10.20.203:50506/2016-06-03T16:04:59Z/batch-0",
    "ts" : NumberLong(1464970015441)
}

errors

Holds the possible document errors that occurs either in the scanning or workflow processing.

Field NameExampleDescription
_idObjectId("576844914b4ae74664a414bd")Mongo's internal id
error/@time1466451089287Time when this error entry was logged
error/@crawlTime1466451085183Identifier of the crawl
error/@csFile_System_SourceIdentifier of the content source
error/@processorFile_System_Source-192.168.56.1:50505The server that processed and reported this error
error/@type

S, D, B, F or U

S: Scanner errors relates to errors caused in the connector scanning stages

D: Document errors relates to fetch, text extraction or workflow processing errors

B: Batch errors relates to failed batches of Aspire jobs

F: Failed errors are not currently being used but they could be later

U: Unknown errors relates to errors where the source is unknown

error/_$Error processing: C:\\test-folder/folderA/testDocument2.txt\ncom.searchtechnologies.aspire.services.AspireException: Exception whilst running script: Rule: 1\r\n\tat..... (more)The error message

Example

{
    "_id" : ObjectId("576844914b4ae74664a414bd"),
    "error" : {
        "@time" : NumberLong(1466451089287),
        "@crawlTime" : NumberLong(1466451085183),
        "@cs" : "File_System_Source",
        "@processor" : "File_System_Source-192.168.56.1:50505",
        "@type" : "D",
        "_$" : "Error processing: C:\\test-folder/folderA/testDocument2.txt\ncom.searchtechnologies.aspire.services.AspireException: Exception whilst running script: Rule: 1\r\n\tat ... (more)"
    }
}

statistics

Holds the crawl statistics per server, what you see in the Administration UI is the sum of all the server statistics associated with the same crawl identified.

FieldNameExampleDescription
_id1466450887680-File_System_Source-192.168.56.1:50505Unique identifier of each statistics object
statistics/@processorFile_System_Source-192.168.56.1:50505The server+content source name
statistics/@server192.168.56.1:50505The server identifier
statistics/@statusA, S, E, F, L, I, N, IP, IWP, IWR, X, IWS or U

The crawl status:

A: Aborted

S: Completed

E: Errored

F: Failed

L: Loading

I: In-Progress

N: New

iP: Paused

IWP: Pausing

IWR: Resuming

X: Stopped

IWS: Stopping

U: Unknown

statistics/@modeF, FR, I, IR, R, T, U

F: Full crawl

FR: Full recovery

I: Incremental crawl

IR: Incremental recovery

R: Real time

T: Test

U: Unknown

statistics/@startTime1466450887680The time when the crawl started
statistics/@endTime1466450905466The time when the crawl ended
statistics/@csFile_System_SourceThe identifier of the content source
statistics/queue/scan/@toScan0Number of documents in the scan queue pending to be scanned
statistics/queue/scan/@scanning0Number of documents from the scan queue currently being scanned
statistics/queue/scan/@scanned11Number of documents from the scan queue already scanned
statistics/queue/scan/@total11Total documents in the scan queue
statistics/queue/process/@toProcess0Number of documents in the process queue pending to be processed
statistics/queue/process/@processing0Number of documents from the process queue currently being processed
statistics/queue/process/@processed121Number of documents from the process queue already processed
statistics/queue/process/@total121Total documents in the process queue
statistics/nProgress/@adding0Number of documents currently being processed as "ADD"
statistics/inProgress/@updating0Number of documents currently being processed as "UPDATE"
statistics/inProgress/@deleting0Number of documents currently being processed as "DELETE"
statistics/inProgress/@total0Total documents currently being processed
statistics/processed/@added121Number of documents processed as "ADD"
statistics/processed/@updated0Number of documents processed as "UPDATE"
statistics/processed/@deleting0Number of documents processed as "DELETE"
statistics/processed/@unchanged0Number of documents processed as "NOCHANGE"
statistics/processed/@excluded0Number of documents "EXCLUDED" from being processed
statistics/processed/@terminated0Number of documents processed but ended as "TERMINATED"
statistics/processed/@errored0Number of documents processed with Errors
statistics/processed/@bytes129470Total bytes processed so far
statistics/processed/@total121Total number of documents processed
statistics/errors/@batch0Number of batch errors
statistics/errors/@scan0Number of scanner errors (errors that happened while scanning for documents)
statistics/errors/@document0Number of document errors (errors that occurred while processing the document)
statistics/errors/@total0Total number of errors

Example

{
    "_id" : "1466450887680-File_System_Source-192.168.56.1:50505",
    "statistics" : {
        "@processor" : "File_System_Source-192.168.56.1:50505",
        "@server" : "192.168.56.1:50505",
        "@status" : "S",
        "@mode" : "F",
        "@startTime" : NumberLong(1466450887680),
        "@endTime" : NumberLong(1466450905466),
        "@cs" : "File_System_Source",
        "queue" : {
            "scan" : {
                "@toScan" : 0,
                "@scanning" : 0,
                "@scanned" : 11,
                "@total" : 11
            },
            "process" : {
                "@toProcess" : 0,
                "@processing" : 0,
                "@processed" : 121,
                "@total" : 121
            }
        },
        "inProgress" : {
            "@adding" : 0,
            "@updating" : 0,
            "@deleting" : 0,
            "@total" : 0
        },
        "processed" : {
            "@added" : 121,
            "@updated" : 0,
            "@deleting" : 0,
            "@unchanged" : 0,
            "@excluded" : 0,
            "@terminated" : 0,
            "@errored" : 0,
            "@bytes" : 129470,
            "@total" : 121
        },
        "errors" : {
            "@batch" : 0,
            "@scan" : 0,
            "@document" : 0,
            "@total" : 0
        }
    }
}

Controlling and Incremental


status

Holds all the crawl control information and its status, this determines when a crawl should be started, paused, stopped, or even complete as successful.

Field NameExampleDescription
_idObjectId("5768448d4b4ae74664a41495")Mongo's internal ID
connectorSource[depends on specific connector]Contains the configuration set for running a new crawl, it depends on what each specific connector needs as configuration
@actionstartProperty from the scheduler specifying the action to be done for this content source. For crawls it should always be "start"
@actionPropertiesfull or incrementalProperty from the scheduler specifying if the crawl should be either an incremental or a full
@crawlId0Aspire's internal ID for the crawl
@normalizedCSNameFile_System_SourceAspire's internal name for the current content source
displayNameFile System SourceThe content source name as the user entered it
@schedulerAspireSystemSchedulerIdentifies the scheduler that created the crawl request. By default it should be "AspireSystemScheduler"
@scheduleId0The schedule id corresponding to the current crawl request.
@jobNumber5A sequential counter of how many jobs has the scheduler served.
@sourceIdFile_System_SourceThe content source identifier
@actionTypemanual or scheduledDetermines if the crawl was started by a periodic schedule or a manual request
@dbId1Legacy property from the scheduler
crawlStart1466460900181The time in milliseconds that this request was created, this will be used as the crawl identifier for rest of the crawl life
crawlStatusA, S, E, F, L, I, N, IP, IWP, IWR, X, IWS or U

The crawl status:

A: Aborted

S: Completed

E: Errored

F: Failed

L: Loading

I: In-Progress

N: New

iP: Paused

IWP: Pausing

IWR: Resuming

X: Stopped

IWS: Stopping

U: Unknown

processDeletesnoneIf any, holds the ID of the server that is scanning through the snapshots to find the deletes at the end of the crawl
processingDeletesStatusfinishedThis flag is only present when the deletes processing is finished
crawlEnd1466460912343If any, the time in milliseconds that this crawl finished

Example

{
    "_id" : ObjectId("57686d0d4b4ae74664a417a8"),
    "connectorSource" : {
        "url" : "C:\\test-folder",
        "partialScan" : "false",
        "subDirUrl" : null,
        "indexContainers" : "true",
        "scanRecursively" : "true",
        "scanExcludedItems" : "false",
        "useACLs" : "false",
        "acls" : null,
        "includes" : null,
        "excludes" : null
    },
    "@action" : "start",
    "@actionProperties" : "full",
    "@crawlId" : "0",
    "@normalizedCSName" : "File_System_Source",
    "displayName" : "File System Source",
    "@scheduler" : "AspireSystemScheduler",
    "@scheduleId" : "2",
    "@jobNumber" : "7",
    "@sourceId" : "File_System_Source",
    "@actionType" : "manual",
    "@dbId" : "2",
    "crawlStart" : NumberLong(1466461453589),
    "crawlStatus" : "S",
    "processDeletes" : "none",
    "processingDeletesStatus" : "finished",
    "crawlEnd" : NumberLong(1466461465352)
}

snapshots

Holds the incremental information needed for determining when a document has changed, have been added or when the get deleted. This is only used by the connectors where it's repositories APIs doesn't provide a way of getting the updates from a single call without having to scan through all the documents again.

Field NameExampleDescription
_idC:\test-folder\folderA\testDocument.txtThe unique ID of each document
containertrue or false

true: If this document can contain documents

false: If it doesn't

crawld0The id of the crawl that introduced this entry
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 digest of the main metadata of each document needed for determine changes
timestamp1466461845637The crawlStart time
parentIdC:\test-folder\folderAThe name of the parent of this item
errortrue or false

true: if this document had an error

false: otherwise

Example

{
    "_id" : "C:\\test-folder\\folderA\testDocument.txt",
    "container" : false,
    "crawlId" : 0,
    "signature" : "CBEC1210FE2D51A8166C3E70D38F8A07",
    "timestamp" : NumberLong(1466461845637),
    "parentId" : "C:\\test-folder\\folderA",
    "error" : false
}
  • No labels