Introduction

The Selenium connector will crawl content from websites using an internet browser to retrieve the pages.

Environment and Access Requirements

Web driver

The Aspire Selenium connector requires to have the latest instance of the web browser to be used, and its respective web driver. The web driver only supports a range of browser versions, if the browser is outside that range the connector will throw an exception while trying to start using the browser.

Before installing the Selenium connector, make sure that:

A supported web browser is installed on all the aspire nodes.
A web driver that supports the web browser in its current version.

Environment Requirements

The Selenium connector run on either Windows or Linux. The web drivers include a version appropriate for each operative system

Framework and Connector Features

Framework Features

Name	Supported
Content Crawling	Yes
Identity Crawling	No
Snapshot-based Incremental s	Yes
Non-snapshot-based Incrementals	No
Document Hierarchy	Yes

Connector Features

Some of the features of the Selenium connector include:

Use a real browser to retrieve the pages.
Avoid compatibility issues with web frameworks such as Angular, React, Node, among others.

Content Crawled

The content retrieved by the connector is entirely defined using SQL statements, so you can select all or subsets of columns from one or more tables. Initially, the data is inserted in to Aspire using the returned column names, but this may be changed by further Aspire processing.

The RDB via Snapshots connector is able to crawl the following objects:

Name	Type	Relevant Metadata	Content Fetch & Extraction	Description
database row		table fields	NA	Fields requested by SQL

Limitations

Due Selenium's own limitations, the connector doesn't support:

Basic authentication
NTLM authentication
Custom HTTP headers.

Due to API limitations, Selenium connector is only compatible with browsers that have a Web Driver implementation, for example:

Google chrome
Mozilla Firefox

Other features are also dependent on browser support, such as Headless Mode.

Page tree

Introduction

Environment and Access Requirements

Web driver

Framework and Connector Features

Framework Features

Connector Features

Content Crawled

Limitations

Contact Us: [email protected]

Page tree

Selenium Crawler - Features

Introduction

Environment and Access Requirements

Web driver

Framework and Connector Features

Framework Features

Connector Features

Content Crawled

Limitations

Contact Us: [email protected]