You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Each Connector installed as a Content Source in Aspire will create a new Database in MongoDB with its name, and multiple collections that it uses for crawling and serving security information. The following describes what every collection is used for as well as an explanation of each of the fields of the documents stored in them.

Document Queues and Metadata

  • processQueue

Manages the items that needs to be processed by the workflow, these items may or may not be sent to scanned.

Field NameExample

Description

_idC:\test-folder\folderA\testDocument.txtThe unique id of the document
metadata[depends on each connector]The necessary metadata fields the connector needs to fetch or populate this document
type[depends on each connector]The serialized version of the ItemType of the document
statusC, P or A

The document processing status:

C: Completed, means it have been already processed

P: in Progress, means it is currently been processed

A: Available, means it is available for been processed

actionadd, update, deleteThe action to be performed to the search engine for the document
timestamp1465334398471The time-stamp when this document was added to the queue
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 signature, when a document changes this signature should also change
parentIdC:\test-folder\folderAThe id of the parent document, in other words the document that scanned the current document
processorFile_System-192.168.1.15:50505The identifier of the Aspire server that processed or is processing the current document
shouldScanfalseDetermines whether or not this document should be considered for scanning
shouldProcesstrueDetermines whether or not this document should be considered for being processed by the workflow
retries0The number of times this document has been retried
nametestDocument.txtThe name of this document
isCrawlRootItemfalseIndicates if this is one of the root crawl items (for internal control)
hierarchyIdC:\test-folder\folderA\testDocument.txtUnique Id for using to generate the hierarchy for this document, it may be different from the _id field

Example:

{
    "_id" : "C:\\test-folder\\folderA\\testDocument.txt",
    "metadata" : {
        "fetchUrl" : "file://C:/test-folder/folderA/testDocument.txt",
        "url" : "file://C:/test-folder/folderA/testDocument.txt"
    },
    "type" : "vtwqabl6oiadwy3pnuxhgzlbojrwq5dfmnug433mn5twszltfzqxg4djojss4y3pnvyg63tfnz2hglsgnfwgk43zon2gk3kjorsw2vdzobsqaaaaaaaaaaaaciaaa6dsaahguylwmexgyylom4xek3tvnuaaaaaaaaaaaaasaaahq4duaacgm2lmmu",
    "status" : "C",
    "action" : "add",
    "timestamp" : NumberLong(1465334398471),
    "signature" : "CBEC1210FE2D51A8166C3E70D38F8A07",
    "parentId" : "C:\\test-folder\\folderA",
    "processor" : "File_System_Source-192.168.56.1:50505",
    "shouldScan" : false,
    "shouldProcess" : true,
    "retries" : 0,
    "name" : "0.txt",
    "isCrawlRootItem" : false,
    "hiearchyId" : "C:\\test-folder\\folderA\\testDocument.txt"
}

 

  • scanQueue

Manages the items that needs to be scanned by the connector, these items may or may not be have been sent to process previously.

Field NameExample

Description

_idC:\test-folder\folderAThe unique id of the document
metadata[depends on each connector]The necessary metadata fields the connector needs to fetch or populate this document
type[depends on each connector]The serialized version of the ItemType of the document
statusC, P or A

The document processing status:

C: Completed, means it have been already processed

P: in Progress, means it is currently been processed

A: Available, means it is available for been processed

actionadd, update, deleteThe action to be performed to the search engine for the document
timestamp1465334398471The time-stamp when this document was added to the queue
signatureCBEC1210FE2D51A8166C3E70D38F8A07An MD5 signature, when a document changes this signature should also change
parentIdC:\test-folderThe id of the parent document, in other words the document that scanned the current document
processorFile_System-192.168.1.15:50505The identifier of the Aspire server that processed or is processing the current document

shouldScan

falseDetermines whether or not this document should be considered for scanning
shouldProcesstrueDetermines whether or not this document was considered for being processed by the workflow
retries0The number of times this document has been retried

name

folderAThe name of this document
isCrawlRootItemfalseIndicates if this is one of the root crawl items (for internal control)
hierarchyIdC:\test-folder\folderA\testDocument.txtUnique Id for using to generate the hierarchy for this document, it may be different from the _id field

 

Example:

{
    "_id" : "C:\\test-folder\\folderA",
    "metadata" : {
        "fetchUrl" : "file://C:/test-folder/folderA",
        "url" : "file://C:/test-folder/folderA",
        "displayUrl" : "C:\\test-folder\\folderA",
        "lastModified" : "2016-02-23T17:08:55Z",
        "dataSize" : 0,
        "acls" : null
    },
    "type" : "vtwqabl6oiadwy3pnuxhgzlbojrwq5dfmnug433mn5twszltfzqxg4djojss4y3pnvyg63tfnz2hglsgnfwgk43zon2gk3kjorsw2vdzobsqaaaaaaaaaaaaciaaa6dsaahguylwmexgyylom4xek3tvnuaaaaaaaaaaaaasaaahq4duaadgm33mmrsxe",
    "status" : "C",
    "action" : "add",
    "timestamp" : NumberLong(1465334398103),
    "signature" : "CD2C65824E45BFE94C71970EEEA18A8C",
    "parentId" : "C:\\test-folder",
    "processor" : "File_System_Source-192.168.56.1:50505",
    "shouldScan" : true,
    "shouldProcess" : true,
    "retries" : 0,
    "name" : "folderA",
    "isCrawlRootItem" : false,
    "hiearchyId" : "C:\\test-folder\\folderA"
}

 

  • hierarchy

Holds the hierarchy information about every single parent document scanned by the connector, each parent contains the information about all its parents all the way up to the root document.

 

Field NameExampleDescription
_idC:\test-folder\folderAUnique id of the parent document
namefolderAName to be used in the hierarchy metadata
ancestors[parent hierarchy info]Holds the same information but for the parent of document, or null if this is a root document

 

Example:

{
    "_id" : "C:\\test-folder\\folderA",
    "name" : "folderA",
    "ancestors" : {
        "_id" : "C:\\test-folder",
        "name" : "test-folder",
        "ancestors" : null
    }
}

 

Statistics and Logging

  • audit

Holds the actions done by the content source for each of the documents.

Field NameExampleDescription
_idObjectId("5750bfa610163e3f58fd7019")Mongo Internal ID
idC:\\test-folder\\folderA\\testDocument.txtUnique Id of the document
crawlStart1464909728339Crawl identifier, each crawl has a different crawlStart time
urlfile://C:/test-folder/folderA/testDocument.txtURL of the document
typejob or batchSpecifies what type of audit log is the current object
actionADD, UPDATE, NOCHANGE, DELETE, BATCH_COMPLETED, BATCH_ERROR, WORKFLOW_COMPLETE, WORKFLOW_TERMINATED, WORKFLOW_ERROR or EXCLUDED

ADD: Discovered as new document to be added

UPDATE: Discovered document with a change

NOCHANGE: Found no change in document

DELETE:  Document was found to be deleted

BATCH_COMPLETED:  The current batch finished

BATCH_ERROR: There was an error closing the batch

WORKFLOW_COMPLETE: The document completed the workflow without errors 

WORKFLOW_TERMINATED: The document was terminated during the workflow

WORKFLOW_ERROR: The document had an error executing the workflow

EXCLUDED: The document was excluded by the include/exclude patterns

batch

    10.10.20.203:50506/2016-06-03T16:04:59Z/batch-0

If any, contains the id of the batch of the current document
ts1464970015441The time this entry was added to the log

 

Example:

{
    "_id" : ObjectId("5751ab210afca2469094bb23"),
    "id" : "C:\\test-folder\\folderA\\testDocument.txt",
    "crawlStart" : NumberLong(1464970009642),
    "url" : "file://C:/test-folder/folderA/testDocument.txt",
    "type" : "job",
    "action" : "WORKFLOW_COMPLETE",
    "batch" : "10.10.20.203:50506/2016-06-03T16:04:59Z/batch-0",
    "ts" : NumberLong(1464970015441)
}

 

  • errors

  • statistics

 

Controlling and Incremental

  • status

  • snapshots

 

 

  • No labels