Introduction


The REST connector can retrieve data from any JSON-based REST endpoint. It is configured to query a base endpoint, extract JSON elements from its response and send the element as an individual document. Each extracted entity can be enriched with more metadata from the same endpoint, or even recursively scan for more contents based on each entity.

The connector configuration is based on crawl rules, each rule is evaluated for every entity discovered. If an entity matches a crawl rule, then it executes the list of requests configured for that rule. There are three types of requests: scan (to discover new entities), metadata extraction (to enrich the current entity with more data), binary fetch (to fetch documents associated with the current entity).

  • The scan queries support pagination, and each page may be requested concurrently if possible, depending on how the pages can be detected.
  • Metadata extraction can be configured to cache results in memory for better performance.
  • Binary fetching allows for further processing such as Apache Tika to be able to extract the contents out of unstructured documents (PDFs, MS Office, HTML pages, etc.)

Each request may be executed with entity-specific metadata. For example, if a metadata enrichment needs to execute GET /entities/${entityId}, then ${entityId} may be configured to be replaced with a known field from the source entity, such as its ID.

Framework and Connector Features


Framework Features

NameSupported
Content Crawlingyes
Identity Crawlingno
Snapshot-based Incrementalsyes
Non-snapshot-based Incrementalsno
Document Hierarchyyes

Limitations


The connector cannot paginate if the links to each page are given by a page link which must be followed. This feature may be added in the future.

The connector cannot make external requests to sites outside the base REST Endpoint. This may be added in the future.

The connector does not extract ACLs without explicit configuration, this is because there isn't a single standard on how REST Endpoints should present permissions data.


  • No labels