Heritrix connector for the Aspire content processing system.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

The Aspire Heritrix connector uses the Heritrix 3.1 crawl engine to crawl seed URLs based on a Heritrix job configuration file (spring application context cxml file). Instead of saving the crawled URLs to a WARC file as Heritrix would do, Aspire implements its own processor that forwards all content extracted by the crawl engine to an Aspire pipeline.

Heritrix Connector
AppBundle Name	Heritrix Connector
Maven Coordinates	com.searchtechnologies.appbundles:cws-heritrix-connector
Versions	1.0-SNAPSHOT
Type Flags	scheduled
Inputs	A Heritrix standard or custom job application context configuration file.
Outputs	An Aspire Object containing the URL and content for each crawled URL.

Features

Access information related to the Heritrix connector.

GitHub repository for this open source connector
Heritrix 3.0 version of the source
Heritrix license

Page tree

Heritrix Introduction

Features