Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Each database holds several collections for its crawl usage:

  • audit
    Holds all the actions of each of the items being processed

    This is an example of a document in this collection:

    Code Block
    languagejs
    themeDJango
    {
        "_id" : ObjectId("571f94498cd956261c112156"),
        "id" : "C:\\dev-temp\\testdata\\A\\0\\0\\0\\3.txt",
        "crawlStart" : NumberLong(1461687363561),
        "url" : "file://C:/dev-temp/testdata/A/0/0/0/3.txt",
        "type" : "job",
        "action" : "ADD",
        "batch" : null,
        "ts" : NumberLong(1461687366086)
    }

     

    • _id: Automatically generated unique id
    • id: Id of the document
    • crawlStart: the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
    • url: The url used for fetching the document
    • type: can be either job or batch,this is used to identify if the audit correspond to a single document or a batch metadata  
    • batch: the id of the batch that processed the document
    • ts: the time when this entry was added
  • errors
    Holds all errors that happened during the crawls.
    This is an example of a document in this collection:

    Code Block
    languagejs
    themeDJango
    {
        "_id" : ObjectId("571f940e8cd956261c112151"),
        "error" : {
            "@time" : NumberLong(1461687310975),
            "@crawlTime" : NumberLong(1461686819675),
            "@cs" : "File_System_Source",
            "@processor" : "File_System_Source-10.10.20.203:50505",
            "@type" : "S",
            "_$" : "Error starting crawl\ncom.searchtechnologies.aspire.services.AspireException: Bad 'exclude' regex pattern: C:\\dev-temp\\testdata\\A ..."
        }
    }

     

    • _id Automatically generated unique id
    • error
      • @time The time when this error happened
      • @crawlTime the identification of the crawl that generated this audit entry (the ID is the time the crawl started in UNIX format)
      • @cs The name of the content source that generated this error
      • @processor The name of the server that generated this error
      • @type The type of error, it can be "S" for scanner error, "D" for document error, "B" for batch error, "F" for Content Source startup failure, or "U" for Unknown
      • _$ The detailed error message
  • hierarchy
    Holds the parent hierarchy for all the container items. This is used to generate the item hierarchy in the Populate & Fetch stage.
    This is an example of a document in this collection:

    Code Block
    languagejs
    themeDJango
    {
        "_id" : "C:\\dev-temp\\testdata\\B\\5\\9\\3",
        "name" : "3",
        "ancestors" : {
            "_id" : "C:\\dev-temp\\testdata\\B\\5\\9",
            "name" : "9",
            "ancestors" : {
                "_id" : "C:\\dev-temp\\testdata\\B\\5",
                "name" : "5",
                "ancestors" : {
                    "_id" : "C:\\dev-temp\\testdata\\B",
                    "name" : "B",
                    "ancestors" : {
                        "_id" : "C:\\dev-temp\\testdata",
                        "name" : "testdata",
                        "ancestors" : null
                    }
                }
            }
        }
    }

     

    • _id The Id of the container item
    • name The name of the container item
    • ancestors All the ancestors of this item
      • _id Id of the parent container item
      • name The name of the parent container item
      • ancestors All the ancestors of the parent item
        • ...  All the grandparents and beyond
  • processQueue
  • scanQueue
  • snapshots
  • statistics
  • status
  • usersAndGroups