Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

The Aspire Heritrix connector uses the Heritrix 3.1.1 crawl engine to crawl seed URLs based on a Heritrix job configuration file (spring application context cxml file). Instead of saving the crawled URLs to a WARC file as Heritrix would do, Aspire implements its own processor that forwards all content extracted by the crawl engine to an Aspire workflow.

Features

The Heritrix Connector will crawl web pages (based on a list of starting URLs that you supply, along with other criteria for including or excluding pages) and forward all content extracted by the crawl engine to an Aspire workflow.

Some of the features of the Heritrix connector include:

Search engine independence
Runs from any machine with access to specified web pages
HTTP authentication
Cookie based authentication

Aspire Heritrix Connector uses a custom version of the Heritrix 3.1.1 crawl engine which includes the following custom features:

NTLM Authentication
XSL Transformation for link extraction
Fixes a bug for whitespaces in URL at javascript link extraction

For more information on configuring custom features see Using a Custom Heritrix Configuration File.

Access information related to the Heritrix connector.

GitHub repository for this open source connector
Heritrix 3.0 version of the source
Heritrix license