You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

The Selenium connector will crawl content from websites using an internet browser to retrieve the pages.

Introduction


Some of the features of the Selenium connector include:

  • Use a real browser to retrieve the pages.
  • Avoid compatibility issues with web frameworks such as Angular, React, Node, among others.

Environment and Access Requirements


Web driver

The Aspire Selenium connector requires to have the latest instance of the web browser to be used, and its respective web driver. The web driver only supports a range of browser versions, if the browser is outside that range the connector will throw an exception while trying to start using the browser.

Before installing the Selenium connector, make sure that:

  • A supported web browser is installed on all the aspire nodes.
  • A web driver that supports the web browser in its current version.

Environment Requirements

The Selenium connector run on either Windows or Linux. The web drivers include a version appropriate for each operative system

Framework and Connector Features


Framework Features

NameSupported
Content CrawlingYes
Identity CrawlingNo
Snapshot-based Incremental sYes
Non-snapshot-based IncrementalsNo
Document HierarchyYes

Connector Features

Retrieve Data per Batch

This mode uses SQL taken from the seed configuration (<discoverySQL><extractSQL>) and execute them against the database configured. Each resulting row is formed into a result object using the column names as document elements, and this document is submitted to a pipeline manager using the event configured for inserts. As the document is created, the value of the column identified in the seed configuration (<idColumn>) is noted as the primary key of the document. The value insert will be placed in the action attribute of the document.

Column names from the extractSQL query are added to the result object inside the "connectorSpecific" field. If the column names are standard Aspire fields, they are added to the root level.

Any change detected in the query set in discoverySQL field will be compare with the snapshot file and report the change if required.

Retrieve Everything

This mode uses SQL taken from the seed configuration (<fullSQL> or configuration) and execute them against the database configured. Each resulting row is formed into a  result object using the column names as document elements, and this document is submitted to a pipeline manager using the event configured for inserts. As the document is created, the value of the column identified in the seed configuration (<idColumn>) is noted as the primary key of the document. The value insert will be placed in the action attribute of the document.

Column names from SQL queries are added to the result object inside the "connectorSpecific" field. If the column names are standard Aspire fields, they are added to the root level.

Any change detected in the query set in fullSQL field will be compare with the snapshot file and report the change if required.

Content Crawled


The content retrieved by the connector is entirely defined using SQL statements, so you can select all or subsets of columns from one or more tables. Initially, the data is inserted in to Aspire using the returned column names, but this may be changed by further Aspire processing.

The RDB via Snapshots connector is able to crawl the following objects:

NameType Relevant MetadataContent Fetch & ExtractionDescription
database row
table fieldsNAFields requested by SQL

Limitations


Due Selenium's own limitations, the connector doesn't support:

  • Basic authentication
  • NTLM authentication
  • Custom HTTP headers.

Due to API limitations, Selenium connector is only compatible with browsers that have a Web Driver implementation, for example:

  • Google chrome
  • Mozilla Firefox

Other features are also dependent on browser support, such as Headless Mode.

  • No labels