The Aspider web crawler will crawl content from a website.

Introduction

The Aspider web crawler will crawl content from any given website.

It is based on the Heritrix HTML Parser and was implemented to meet the increasing needs of the customers regarding features and customization.

Framework and Connector Features

Name	Supported
Content Crawling	yes
Identity Crawling	no
Snapshot-based Incrementals	yes
Non-snapshot-based Incrementals	no
Document Hierarchy	no

The Aspider web crawler has the following features:

The Aspider web crawler is able to crawl the following objects:

Name	Type	Relevant Metadata	Content Fetch and Extraction	Description
Web Page	document	HTML Meta tags, HTTP headers	Yes	Pages discovered on the target website

The Aspider web crawler has the following limitations:

There is no support for web sites with dynamic content, for such web sites please refer to the Selenium web crawler.
There is no support for orphan pages, any page cut from the scope because the link pointing to them no longer exist, will be removed once the conditions are met.
Selenium based authentication requires the corresponding web browser to be installed in the server where the crawl will be executed, as well the corresponding web driver to be downloaded according to the browser version.