Heritrix connector for the Aspire content processing system.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
The Aspire Heritrix connector uses the Heritrix 3.1 crawl engine to crawl seed URLs based on a Heritrix job configuration file (spring application context cxml file). Instead of saving the crawled URLs to a WARC file as Heritrix would do, Aspire implements its own processor that forwards all content extracted by the crawl engine to an Aspire pipeline.
Heritrix Connector | |
---|---|
AppBundle Name | Heritrix Connector |
Maven Coordinates | com.searchtechnologies.appbundles:cws-heritrix-connector |
Versions | 1.0-SNAPSHOT |
Type Flags | scheduled |
Inputs | A Heritrix standard or custom job application context configuration file. |
Outputs | An Aspire Object containing the URL and content for each crawled URL. |
Access information related to the Heritrix connector.