Introduction


The Selenium connector will crawl content from websites using an internet browser to retrieve the pages.

Environment and Access Requirements


Web driver

The Aspire Selenium connector requires the latest instance of the web browser to be used, and its respective web driver. The web driver only supports a range of browser versions, if the browser is outside that range the connector will throw an exception while trying to start using the browser.

Before installing the Selenium connector, make sure that:

  • A supported web browser is installed on all the Aspire nodes.
  • A web driver that supports the web browser in its current version.

Environment Requirements

The Selenium connector run on either Windows or Linux. The web drivers include a version appropriate for each operative system

Framework and Connector Features


Framework Features

NameSupported
Content CrawlingYes
Identity CrawlingNo
Snapshot-based Incremental sYes
Non-snapshot-based IncrementalsNo
Document HierarchyYes

Connector Features

Some features of the Selenium connector include:

  • Use a real browser to retrieve the pages.
  • Avoid compatibility issues with web frameworks such as Angular, React, Node, among others.

Content Crawled


The Selenium connector retrieves several types of documents, such as: 

  • Web Pages.
  • Sitemaps.
  • Binary documents (PDF, word, images).
NameTypeRelevant MetadataContent Fetch and ExtractionDescription
Web PagedocumentHTML Meta tags, HTTP headersYesPages discovered on the target website

Limitations


Due to Selenium's own limitations, the connector doesn't support:

  • Basic authentication
  • NTLM authentication
  • Custom HTTP headers.

Due to API limitations, Selenium connector is only compatible with browsers that have a Web Driver implementation, for example:

  • Google Chrome
  • Mozilla Firefox

Other features are also dependent on browser support, such as Headless Mode.

  • No labels