job:

Processing unit representing a single document or entity within a crawl. May hold metadata.

seed:

[ Config Entity ] Crawl starting point, generally a URL relative to a particular server.

connection:

Config Entity ] Details regarding how to connect to a given repository server or service.

credential:

Config Entity ] Authentication details to access a given repository server or service, generally Username/password or AccessKey/SecretKey pairs.

connector instance:

Config Entity ] Base config entity determining type of source repository, and common crawling behavior

workflow:

Config Entity ] Set of rules (grouped by workflow event) to be executed sequentially for every given job being processed. See Workflows

workflow event:

A virtual set of rules to be executed sequentially that lives inside a workflow object.

throttle policy:

Config Entity ] Set of properties determining how often should jobs be processed, can be assigned in credentials, connections or seeds. See Throttling Policies

routing policy:

Config Entity ] Property that determines where a job should be processed (in which worker node). See Routing Policies

schedule:

Config Entity ] Set of properties that determines how frequently should a crawl start, can be associated with a set of seeds. See Schedules

sequence schedule:

Config Entity ] Special type of schedule that determines the sequence of crawl starts (after which crawls should other crawls start). See

identity crawl:

Crawl type ] Crawl that connects to Identity directories (Ldap, Azure Directory) to cache and process the identities as jobs

full crawl:

Crawl type ] Crawl that retrieves all documents starting at a given point (seed)

incremental crawl:

Crawl type ] Crawl that retrieves only the changed documents relative to previous crawls

  • No labels