Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
The Aspire Heritrix connector uses the Heritrix 3.1.1 crawl engine to crawl seed URLs based on a Heritrix job configuration file (spring application context cxml file). Instead of saving the crawled URLs to a WARC file as Heritrix would do, Aspire implements its own processor that forwards all content extracted by the crawl engine to an Aspire pipelineworkflow.
The Heritrix Connector will crawl web pages (based on a list of starting URLs that you supply, along with other criteria for including or excluding pages) and forward all content extracted by the crawl engine to an Aspire pipelineworkflow.
Some of the features of the Heritrix connector include:
Aspire Heritrix Connector uses a custom version of the Heritrix 3.1.1 crawl engine which includes the following custom features:
For more information on configuring custom features see Using a Custom Heritrix Configuration File.
Access information related to the Heritrix connector.