Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

The Aspire Heritrix connector uses the Heritrix 3.1.1 crawl engine to crawl seed URLs based on a Heritrix job configuration file (spring application context cxml file). Instead of saving the crawled URLs to a WARC file as Heritrix would do, Aspire implements its own processor that forwards all content extracted by the crawl engine to an Aspire workflow.

 

 

 

Features


 The Heritrix Connector will crawl web pages (based on a list of starting URLs that you supply, along with other criteria for including or excluding pages) and forward all content extracted by the crawl engine to an Aspire workflow.

Some of the features of the Heritrix connector include:

  • Search engine independence
  • Runs from any machine with access to specified web pages
  • HTTP authentication
  • Cookie based authentication

Aspire Heritrix Connector uses a custom version of the Heritrix 3.1.1 crawl engine which includes the following custom features:

  • NTLM Authentication
  • XSL Transformation for link extraction
  • Fixes a bug for whitespaces in URL at javascript link extraction

For more information on configuring custom features see Using a Custom Heritrix Configuration File.

Access information related to the Heritrix connector.