Incremental Crawling

By default, the Connector Framework allows connectors to handle incremental crawling using the snapshot NoSQL database. This snapshot contains an entry for each item discovered by the last crawl with an id, a subset of the metadata, a signature and a crawl id. On the following incremental the action is determined using the following criteria:

Add: there is no entry on the snapshot with the given item id.
Update: there's an entry on the snapshot with the given id, but different signature.
Delete: the crawl id on each item that is visited during an incremental crawl is updated to reflect the new crawl id. In the end of each incremental crawl, all items that didn't get the new crawl id are sent to the delete process.

Hierarchy

The Connector Framework is able to create the hierarchical structure of a seed based on how items are being discovered. This feature depends on the specific type of connector in use. To know if a connector supports hierarchy generation, check its documentation.

Page tree

Incremental Crawling

Hierarchy

Fetch Content

Non-Text Documents

Text Extraction

Identity Crawling and Group Expansion

Failed Documents Processing

Contact Us: [email protected]

Page tree

Connectors Features

Incremental Crawling

Hierarchy

Fetch Content

Non-Text Documents

Text Extraction

Identity Crawling and Group Expansion

Failed Documents Processing

Contact Us: [email protected]