Heritrix Introduction (Aspire 2)

Features

Heritrix is the Internet Archive's open-source, archival-quality web crawler project (https://webarchive.jira.com/wiki/display/Heritrix/Heritrix).

The Heritrix Connector will crawl web pages (based on a list of starting URLs that you supply, along with other criteria for including or excluding pages) and forward all content extracted by the crawl engine to an Aspire pipeline.

Some of the features of the Heritrix connector include:

Is search engine independent
Runs from any machine with access to the given web pages

Aspire Heritrix Connector uses a custom version of Heritrix 3.1.1 Engine which includes the following custom features:

NTLM Authentication
XSL Transformation for link extraction
Fix bug for whitespaces in URL at javascript link extraction

For more information on configuring custom features see Using a Custom Heritrix Configuration File

Future Development Plan

Anything else we should add? Please let us know.

Page tree

Heritrix Introduction (Aspire 2)

Features

Future Development Plan