You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Heritrix connector for the Aspire content processing system.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

The Aspire Heritrix connector uses the Heritrix 3.1 crawl engine to crawl seed URLs based on a Heritrix job configuration file (spring application context cxml file). Instead of saving the crawled URLs to a WARC file as Heritrix would do, Aspire implements its own processor that forwards all content extracted by the crawl engine to an Aspire pipeline.

 

Heritrix Connector
AppBundle NameHeritrix Connector
Maven Coordinates

com.searchtechnologies.appbundles:cws-heritrix-connector

Versions1.0-SNAPSHOT
Type Flags

scheduled

InputsA Heritrix standard or custom job application context configuration file.
OutputsAn Aspire Object containing the URL and content for each crawled URL.

Features


Access information related to the Heritrix connector.


 

  • No labels