You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »


Incremental Crawling 


By default, the Connector Framework allows connectors to handle incremental crawling using the snapshot NoSQL database. This snapshot contains an entry for each item discovered by the last crawl with an id, a subset of the metadata, a signature and a crawl id. On the following incremental the action is determined using the following criteria:

  • Add: there is no entry on the snapshot with the given item id.
  • Update: there's an entry on the snapshot with the given id, but different signature.
  • Delete: the crawl id on each item that is visited during an incremental crawl is updated to reflect the new crawl id. In the end of each incremental crawl, all items that didn't get the new crawl id are sent to the delete process. 


Hierarchy


The Connector Framework is able to create the hierarchical structure of a seed based on how items are being discovered. This feature depends on the specific type of connector in use. To know if a connector supports hierarchy generation, check its documentation.


Fetch Content


Connectors can fetch the content of a document and set the content stream on the job so that it can be processed on a later stage. Each connector will allow content fetching on specific types of items, check the documentation to see which ones are allowed.

Text Extraction

If text extraction is enabled, the content stream opened during the fetch stage is sent to Apache Tika to extract the text content of the document. Take into account that the text extraction will consume the content stream, thus making it unavailable for other components to work with it.

Non-Text Documents

The connectors that allow content fetching can be configured so that certain document types are not processed on the text extraction stage, leaving the data stream open so it can be processed on a Workflow stage. Non-Text documents can be identified using a comma separated list of extensions or a file containing a list of regex patterns to match the documents (one regex pattern per line).


Identity Crawling and Group Expansion




Failed Documents Processing




  • No labels