Heritrix Administration FAQ (Aspire 2)

General

After a few minutes, there are no updates submitted in the content source statistics

This is normally a DNS issue. Check with your network team to see if the site URL(s) you want to crawl are reachable from the server running Aspire.

Why is does an incremental crawl last as long as a full crawl?

The Heritrix Connector performs incremental crawls based on a disk-backed HashMap, which have the exact documents that have been indexed by the connector to the search engine associated with a content digest signature. On an incremental crawl the connector fully crawls the web sites configured the same way as a full crawl, but it only indexes the modified, new or deleted documents during that crawl.

For a discussion on crawling, see here

Save your content source before creating or editing another one

Failing to save a content source before creating or editing another content source can result in an error.

ERROR [aspire]: Exception received attempting to get execute component command com.searchtechnologies.aspire.services.AspireException: Unable to find content source

Save the initial content source before creating or working on another.

Why am I only getting 6000 documents discovered per URL?

Heritrix by default sets a maximum of 6000 links to extract from a single URL, the rest of the links found are discarded an therefore not crawled. You can configure that by changing a bean inside a custom heritrix crawlear beans. See more information on how to configure that at Using a Custom Heritrix Configuration File.

Page tree